You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by liang <pw...@gmail.com> on 2019/12/19 04:24:42 UTC

ExponentialBackoffRetry implemented a incorrect retry policy?

Hi there,

Today we had encountered a build error. The step "Build N-Dimension Cuboid
: level 1" failed caused by WriteConflictException.


> 2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696795196, to newTs: 1576696799048, operation result: true
> 2019-12-19 03:20:02,775 INFO  [BadQueryDetector]
> service.BadQueryDetector:147 : Detect bad query.
> 2019-12-19 03:20:03,254 TRACE [Scheduler 1528161485 Job
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
> oldTs: 1576696793142, to newTs: 1576696803253, operation result: true
> 2019-12-19 03:20:03,256 TRACE [Scheduler 1528161485 Job
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
> oldTs: 1576696803253, to newTs: 1576696803255, operation result: true
> 2019-12-19 03:20:04,187 INFO  [FetcherRunner 2062551616-143]
> threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2
> actual running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10
> discarded, 0 others
> 2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696799048, to newTs: 1576696809122, operation result: true
> 2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
> 2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
> oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
> 2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879]
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1
> seconds.
> 2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
> 2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
> oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
> 2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 :
> error execute
> MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build
> N-Dimension Cuboid : level 1, state=RUNNING}
> org.apache.kylin.common.persistence.WriteConflictException: Overwriting
> conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect
> old TS 1576696809124, but it is 1576696819131
> at
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
> at
> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
> at
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
> at
> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
> at
> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
> at
> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
> at
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
> at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
> at
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
> at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
> at
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>


At 03:20:24,133, the thread "Scheduler 1528161485 Job
4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the
path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the
exception was raised for some reasons.

ExponentialBackoffRetry will retry if the following exceptions happened:
SocketTimeoutException, or ConnectException, or RetriesExhaustedException
or any other exceptions configured in
"kylin.resourcestore.connection-exceptions"

By default the retry policy is disabled, but we have enabled it by setting
"kylin.resourcestore.reconnect-enabled". Because we have a heavy loading
HBase cluster.

Because failed to call checkAndPutResourceImpl, the thread "Scheduler
1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879" sleeps 1 second to
back-off.


> 2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
> 2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
> oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
> 2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879]
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1
> seconds.
> 2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
> 2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
> oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
> 2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 :
> error execute
> MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build
> N-Dimension Cuboid : level 1, state=RUNNING}
> org.apache.kylin.common.persistence.WriteConflictException: Overwriting
> conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect
> old TS 1576696809124, but it is 1576696819131
> at
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>


Here is the guess of what's happended at that time:

[image: Screen Shot 2019-12-19 at 12.16.05.png]
Thus, we think ExponentialBackoffRetry implemented an incorrect retry
policy. Determining whether to retry only by the type of exception is not
reliable. At least, we should pull the last modified time to verify the
written is successful or not. Right?

Re: ExponentialBackoffRetry implemented a incorrect retry policy?

Posted by nichunen <ni...@apache.org>.
Hi Liang,


Thanks very much, looking forward to your fix.



Best regards,

 

Ni Chunen / George



On 12/19/2019 15:42,liang<pw...@gmail.com> wrote:
Hi Chunen,



I've created a Jira issue: https://issues.apache.org/jira/browse/KYLIN-4307. We are trying to fix the issue. And the PR will be created later.


Regards,
Liang


On Thu, Dec 19, 2019 at 2:52 PM nichunen <ni...@apache.org> wrote:

Hi Liang,


I have closely read your analysis for this case, and I agree with you. 


Would you please open a Jira issue? What’s more, pull request for this is welcome.



Best regards,

 

Ni Chunen / George



On 12/19/2019 12:24,liang<pw...@gmail.com> wrote:
Hi there,



Today we had encountered a build error. The step "Build N-Dimension Cuboid : level 1" failed caused by WriteConflictException.




2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696795196, to newTs: 1576696799048, operation result: true
2019-12-19 03:20:02,775 INFO  [BadQueryDetector] service.BadQueryDetector:147 : Detect bad query.
2019-12-19 03:20:03,254 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696793142, to newTs: 1576696803253, operation result: true
2019-12-19 03:20:03,256 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696803253, to newTs: 1576696803255, operation result: true
2019-12-19 03:20:04,187 INFO  [FetcherRunner 2062551616-143] threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2 actual running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10 discarded, 0 others
2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696799048, to newTs: 1576696809122, operation result: true
2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 seconds.
2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build N-Dimension Cuboid : level 1, state=RUNNING}
org.apache.kylin.common.persistence.WriteConflictException: Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect old TS 1576696809124, but it is 1576696819131
at org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
at org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
at org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
at org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
at org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
at org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
at org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




At 03:20:24,133, the thread "Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the exception was raised for some reasons.

ExponentialBackoffRetry will retry if the following exceptions happened: SocketTimeoutException, or ConnectException, or RetriesExhaustedException or any other exceptions configured in "kylin.resourcestore.connection-exceptions"

By default the retry policy is disabled, but we have enabled it by setting "kylin.resourcestore.reconnect-enabled". Because we have a heavy loading HBase cluster.

Because failed to call checkAndPutResourceImpl, the thread "Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879" sleeps 1 second to back-off.



2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 seconds.
2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build N-Dimension Cuboid : level 1, state=RUNNING}
org.apache.kylin.common.persistence.WriteConflictException: Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect old TS 1576696809124, but it is 1576696819131
at org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)





Here is the guess of what's happended at that time:




Thus, we think ExponentialBackoffRetry implemented an incorrect retry policy. Determining whether to retry only by the type of exception is not reliable. At least, we should pull the last modified time to verify the written is successful or not. Right?

Re: ExponentialBackoffRetry implemented a incorrect retry policy?

Posted by liang <pw...@gmail.com>.
Hi Chunen,

I've created a Jira issue: https://issues.apache.org/jira/browse/KYLIN-4307.
We are trying to fix the issue. And the PR will be created later.

Regards,
Liang

On Thu, Dec 19, 2019 at 2:52 PM nichunen <ni...@apache.org> wrote:

> Hi Liang,
>
> I have closely read your analysis for this case, and I agree with you.
>
> Would you please open a Jira issue? What’s more, pull request for this is
> welcome.
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>
> On 12/19/2019 12:24,liang<pw...@gmail.com> <pw...@gmail.com> wrote:
>
> Hi there,
>
> Today we had encountered a build error. The step "Build N-Dimension Cuboid
> : level 1" failed caused by WriteConflictException.
>
>
>> 2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696795196, to newTs: 1576696799048, operation result: true
>> 2019-12-19 03:20:02,775 INFO  [BadQueryDetector]
>> service.BadQueryDetector:147 : Detect bad query.
>> 2019-12-19 03:20:03,254 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696793142, to newTs: 1576696803253, operation result: true
>> 2019-12-19 03:20:03,256 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696803253, to newTs: 1576696803255, operation result: true
>> 2019-12-19 03:20:04,187 INFO  [FetcherRunner 2062551616-143]
>> threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2
>> actual running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10
>> discarded, 0 others
>> 2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696799048, to newTs: 1576696809122, operation result: true
>> 2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
>> 2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
>> 2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879]
>> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1
>> seconds.
>> 2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
>> 2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
>> 2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 :
>> error execute
>> MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build
>> N-Dimension Cuboid : level 1, state=RUNNING}
>> org.apache.kylin.common.persistence.WriteConflictException: Overwriting
>> conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect
>> old TS 1576696809124, but it is 1576696819131
>> at
>> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
>> at
>> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
>> at
>> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
>> at
>> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
>> at
>> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
>> at
>> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
>> at
>> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>> at
>> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>> at
>> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>> at
>> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>
>
> At 03:20:24,133, the thread "Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the
> path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the
> exception was raised for some reasons.
>
> ExponentialBackoffRetry will retry if the following exceptions happened:
> SocketTimeoutException, or ConnectException, or RetriesExhaustedException
> or any other exceptions configured in
> "kylin.resourcestore.connection-exceptions"
>
> By default the retry policy is disabled, but we have enabled it by setting
> "kylin.resourcestore.reconnect-enabled". Because we have a heavy loading
> HBase cluster.
>
> Because failed to call checkAndPutResourceImpl, the thread "Scheduler
> 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879" sleeps 1 second to
> back-off.
>
>
>> 2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
>> 2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
>> 2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879]
>> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1
>> seconds.
>> 2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
>> 2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
>> 2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 :
>> error execute
>> MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build
>> N-Dimension Cuboid : level 1, state=RUNNING}
>> org.apache.kylin.common.persistence.WriteConflictException: Overwriting
>> conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect
>> old TS 1576696809124, but it is 1576696819131
>> at
>> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>>
>
>
> Here is the guess of what's happended at that time:
>
> [image: Screen Shot 2019-12-19 at 12.16.05.png]
> Thus, we think ExponentialBackoffRetry implemented an incorrect retry
> policy. Determining whether to retry only by the type of exception is not
> reliable. At least, we should pull the last modified time to verify the
> written is successful or not. Right?
>
>

Re:ExponentialBackoffRetry implemented a incorrect retry policy?

Posted by nichunen <ni...@apache.org>.
Hi Liang,


I have closely read your analysis for this case, and I agree with you. 


Would you please open a Jira issue? What’s more, pull request for this is welcome.



Best regards,

 

Ni Chunen / George



On 12/19/2019 12:24,liang<pw...@gmail.com> wrote:
Hi there,



Today we had encountered a build error. The step "Build N-Dimension Cuboid : level 1" failed caused by WriteConflictException.




2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696795196, to newTs: 1576696799048, operation result: true
2019-12-19 03:20:02,775 INFO  [BadQueryDetector] service.BadQueryDetector:147 : Detect bad query.
2019-12-19 03:20:03,254 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696793142, to newTs: 1576696803253, operation result: true
2019-12-19 03:20:03,256 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696803253, to newTs: 1576696803255, operation result: true
2019-12-19 03:20:04,187 INFO  [FetcherRunner 2062551616-143] threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2 actual running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10 discarded, 0 others
2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696799048, to newTs: 1576696809122, operation result: true
2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 seconds.
2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build N-Dimension Cuboid : level 1, state=RUNNING}
org.apache.kylin.common.persistence.WriteConflictException: Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect old TS 1576696809124, but it is 1576696819131
at org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
at org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
at org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
at org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
at org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
at org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
at org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
at org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




At 03:20:24,133, the thread "Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the exception was raised for some reasons.

ExponentialBackoffRetry will retry if the following exceptions happened: SocketTimeoutException, or ConnectException, or RetriesExhaustedException or any other exceptions configured in "kylin.resourcestore.connection-exceptions"

By default the retry policy is disabled, but we have enabled it by setting "kylin.resourcestore.reconnect-enabled". Because we have a heavy loading HBase cluster.

Because failed to call checkAndPutResourceImpl, the thread "Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879" sleeps 1 second to back-off.



2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 seconds.
2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build N-Dimension Cuboid : level 1, state=RUNNING}
org.apache.kylin.common.persistence.WriteConflictException: Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect old TS 1576696809124, but it is 1576696819131
at org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)





Here is the guess of what's happended at that time:




Thus, we think ExponentialBackoffRetry implemented an incorrect retry policy. Determining whether to retry only by the type of exception is not reliable. At least, we should pull the last modified time to verify the written is successful or not. Right?