You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/09/23 05:28:22 UTC
[GitHub] [helix] kaisun2000 opened a new issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
kaisun2000 opened a new issue #1394:
URL: https://github.com/apache/helix/issues/1394
LOG 1244
>020-09-23T03:25:49.9707808Z [ERROR] stopAndDeleteQueue(org.apache.helix.integration.task.TestTaskRebalancerStopResume) Time elapsed: 600.045 s <<< FAILURE!
2020-09-23T03:25:49.9710243Z org.apache.helix.HelixException: Workflow "stopAndDeleteQueue", job "stopAndDeleteQueue_masterJob" timed out
2020-09-23T03:25:49.9713274Z at org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopAndDeleteQueue(TestTaskRebalancerStopResume.java:436)
2020-09-23T03:25:49.9715618Z
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org
[GitHub] [helix] jiajunwang commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
Posted by GitBox <gi...@apache.org>.
jiajunwang commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-849105254
Close test unstable tickets since we have an automatic tracking mechanism https://github.com/apache/helix/pull/1757 now for tracking the most recent test issues.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org
[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699573904
LOG 1795
>2020-09-27T02:06:47.8122232Z org.apache.helix.HelixException: Workflow "stopAndDeleteQueue" context is null or job "stopAndDeleteQueue_slaveJob" is not in states: [COMPLETED]; ctx is ZnRecord=WorkflowContext, {NAME=stopAndDeleteQueue, START_TIME=1601170046460, STATE=IN_PROGRESS}{JOB_STATES={stopAndDeleteQueue_masterJob=COMPLETED}, StartTime={stopAndDeleteQueue_masterJob=1601170046460}}{}, Stat=Stat {_version=0, _creationTime=0, _modifiedTime=0, _ephemeralOwner=0}, jobState is null, wf cfg ZnRecord=stopAndDeleteQueue, {AllowOverlapJobAssignment=false, Dag={"allNodes":["stopAndDeleteQueue_masterJob"],"childrenToParents":{},"parentsToChildren":{}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, TargetState=START, Terminable=false, WorkflowID=stopAndDeleteQueue, capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=3, _creationTime=1601170046441, _modifiedTime=1601170046511, _ephemeralOwner=0}, jobcfg null, j
bctx null
2020-09-27T02:06:47.8128562Z at org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopAndDeleteQueue(TestTaskRebalancerStopResume.java:458)
2020-09-27T02:06:47.8130823Z
2020-09-27T02:06:48.2084063Z [ERROR] Failures:
2020-09-27T02:06:48.2087741Z [ERROR] TestTaskRebalancerStopResume.stopAndDeleteQueue:458 » Helix Workflow "stopAndD...
2020-09-27T02:06:48.2089226Z [ERROR] Tests run: 1207, Failures: 1, Errors: 0, Skipped: 1
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org
[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699334297
LOG 1605
>2020-09-26T03:31:14.6435435Z [ERROR] Tests run: 1207, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 5,269.987 s <<< FAILURE! - in TestSuite
2020-09-26T03:31:14.6447448Z [ERROR] stopAndDeleteQueue(org.apache.helix.integration.task.TestTaskRebalancerStopResume) Time elapsed: 600.853 s <<< FAILURE!
2020-09-26T03:31:14.6459709Z org.apache.helix.HelixException: Workflow "stopAndDeleteQueue" context is null or job "stopAndDeleteQueue_slaveJob" is not in states: [COMPLETED]
2020-09-26T03:31:14.6546658Z at org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopAndDeleteQueue(TestTaskRebalancerStopResume.java:456)
2020-09-26T03:31:14.6552113Z
2020-09-26T03:31:15.0677866Z [ERROR] Failures:
2020-09-26T03:31:15.0694994Z [ERROR] TestTaskRebalancerStopResume.stopAndDeleteQueue:456 » Helix Workflow "stopAndD...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org
[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699574199
LOG 1793
>2020-09-27T02:06:59.2960843Z [ERROR] stopAndDeleteQueue(org.apache.helix.integration.task.TestTaskRebalancerStopResume) Time elapsed: 608.535 s <<< FAILURE!
2020-09-27T02:06:59.3100773Z org.apache.helix.HelixException: Workflow "stopAndDeleteQueue" context is null or job "stopAndDeleteQueue_slaveJob" is not in states: [COMPLETED]; ctx is ZnRecord=WorkflowContext, {NAME=stopAndDeleteQueue, START_TIME=1601170052753, STATE=IN_PROGRESS}{JOB_STATES={stopAndDeleteQueue_masterJob=COMPLETED}, StartTime={stopAndDeleteQueue_masterJob=1601170052753}}{}, Stat=Stat {_version=0, _creationTime=0, _modifiedTime=0, _ephemeralOwner=0}, jobState is null, wf cfg ZnRecord=stopAndDeleteQueue, {AllowOverlapJobAssignment=false, Dag={"allNodes":["stopAndDeleteQueue_masterJob"],"childrenToParents":{},"parentsToChildren":{}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, TargetState=START, Terminable=false, WorkflowID=stopAndDeleteQueue, capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=3, _creationTime=1601170052735, _modifiedTime=1601170052814, _ephemeralOwner=0}, jobcfg null, jb
ctx null
2020-09-27T02:06:59.3115829Z at org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopAndDeleteQueue(TestTaskRebalancerStopResume.java:458)
2020-09-27T02:06:59.3119284Z
2020-09-27T02:06:59.7060789Z [ERROR] Failures:
2020-09-27T02:06:59.7064081Z [ERROR] TestTaskRebalancerStopResume.stopAndDeleteQueue:458 » Helix Workflow "stopAndD...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org
[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-698047278
fix by batch add.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org
[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699699921
Note resource config is
JOB stopAndDeleteQueue_masterJob config is there
JOB stopAndDeleteQueue_slave config is not there
WorkFlow stopAndDeleteQueue config is there with both master/slave jobs.
This is due to selective update sequence and job adding sequence race condition. This can be reproduced in debugger too.
Will write a doc about the detail.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org
[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699699145
The case is exactly the same as I can confirm from debugger.
LOG 1843
>2020-09-27T21:29:31.1669800Z END: WorkflowControllerDataProvider.refresh() for cluster CLUSTER_TestTaskRebalancerStopResume, pipleline TASK, Cache resrouce config Content:**{stopAndDeleteQueue_masterJob=ZnRecord=stopAndDeleteQueue_masterJob, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalView=false, Expiry=86400000, FailureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"2000"}, JobID=stopAndDeleteQueue_masterJob, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask=10, RebalanceRunningTask=false, TargetPartitionStates=MASTER, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=stopAndDeleteQueue}{}{},** Stat=Stat {_version=0, _creationTime=1601242170561, _modifiedTime=1601242170561, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_masterJob0=ZnRecord=deleteJobFromRecurrentQueueNotStarted_masterJob0, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalView=false, Expiry=86400000, Fail
ureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"200"}, JobID=deleteJobFromRecurrentQueueNotStarted_masterJob0, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask=10, RebalanceRunningTask=false, TargetPartitionStates=MASTER, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=deleteJobFromRecurrentQueueNotStarted}{}{}, Stat=Stat {_version=0, _creationTime=1601242166392, _modifiedTime=1601242166392, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0=ZnRecord=deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalView=false, Expiry=86400000, FailureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"200"}, JobID=deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask=10, RebalanceRunningTask=false, TargetP
artitionStates=MASTER, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=deleteJobFromRecurrentQueueNotStarted_20200927T212926}{}{}, Stat=Stat {_version=0, _creationTime=1601242166409, _modifiedTime=1601242166409, _ephemeralOwner=0}, **stopAndDeleteQueue=ZnRecord=stopAndDeleteQueue, {AllowOverlapJobAssignment=false, Dag={"allNodes":["stopAndDeleteQueue_masterJob","stopAndDeleteQueue_slaveJob"],"childrenToParents":{"stopAndDeleteQueue_slaveJob":["stopAndDeleteQueue_masterJob"]},"parentsToChildren":{"stopAndDeleteQueue_masterJob":["stopAndDeleteQueue_slaveJob"]}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, TargetState=START, Terminable=false, WorkflowID=stopAndDeleteQueue,** capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=2, _creationTime=1601242170547, _modifiedTime=1601242170572, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted=ZnRecord=deleteJobFromRecurrentQueueNotStarted, {
AllowOverlapJobAssignment=false, Dag={"allNodes":["deleteJobFromRecurrentQueueNotStarted_masterJob0","deleteJobFromRecurrentQueueNotStarted_slaveJob1"],"childrenToParents":{"deleteJobFromRecurrentQueueNotStarted_slaveJob1":["deleteJobFromRecurrentQueueNotStarted_masterJob0"]},"parentsToChildren":{"deleteJobFromRecurrentQueueNotStarted_masterJob0":["deleteJobFromRecurrentQueueNotStarted_slaveJob1"]}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, RecurrenceInterval=60, RecurrenceUnit=SECONDS, StartTime=09-27-2020 21:29:26, TargetState=START, Terminable=false, WorkflowID=deleteJobFromRecurrentQueueNotStarted, capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=2, _creationTime=1601242166394, _modifiedTime=1601242170517, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_slaveJob1=ZnRecord=deleteJobFromRecurrentQueueNotStarted_slaveJob1, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalV
iew=false, Expiry=86400000, FailureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"200"}, JobID=deleteJobFromRecurrentQueueNotStarted_slaveJob1, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask=10, RebalanceRunningTask=false, TargetPartitionStates=SLAVE, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=deleteJobFromRecurrentQueueNotStarted}{}{}, Stat=Stat {_version=0, _creationTime=1601242166392, _modifiedTime=1601242166392, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_20200927T212926=ZnRecord=deleteJobFromRecurrentQueueNotStarted_20200927T212926, {AllowOverlapJobAssignment=false, Dag={"allNodes":["deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0","deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1"],"childrenToParents":{"deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1":["deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0"]},"parentsToChildren"
:{"deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0":["deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1"]}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, StartTime=09-27-2020 21:29:26, TargetState=START, Terminable=true, WorkflowID=deleteJobFromRecurrentQueueNotStarted_20200927T212926, capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=0, _creationTime=1601242166410, _modifiedTime=1601242166410, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1=ZnRecord=deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalView=false, Expiry=86400000, FailureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"200"}, JobID=deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask
=10, RebalanceRunningTask=false, TargetPartitionStates=SLAVE, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=deleteJobFromRecurrentQueueNotStarted_20200927T212926}{}{}, Stat=Stat {_version=0, _creationTime=1601242166410, _modifiedTime=1601242166410, _ephemeralOwner=0}}
2020-09-27T21:29:31.1710456Z Job stopAndDeleteQueue_slaveJob exists in jobdag bug job config missing, expire the job
2020-09-27T21:29:31.1711452Z removed job config:/TaskRebalancer/stopAndDeleteQueue_slaveJob
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org
[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699338155
1. enhance log to see what state job is or ctx is null
```
if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {
throw new HelixException(
- String.format("Workflow \"%s\" context is null or job \"%s\" is not in states: %s",
- workflowName, jobName, allowedStates));
+ String.format("Workflow \"%s\" context is null or job \"%s\" is not in states: %s; ctx is %s, jobState is %s .",
+ workflowName, jobName, allowedStates, ctx == null ? "null" : ctx, ctx != null ? ctx.getJobState(jobName) : "null"));
}
```
2. don't let job finish too quickly.
```
Set<String> slave = Sets.newHashSet("SLAVE");
JobConfig.Builder job2 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
.setJobCommandConfigMap(Collections.singletonMap(MockTask.JOB_DELAY, "2000"))
.setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(slave);
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org
[GitHub] [helix] jiajunwang closed issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue
Posted by GitBox <gi...@apache.org>.
jiajunwang closed issue #1394:
URL: https://github.com/apache/helix/issues/1394
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org