You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/09/23 05:28:22 UTC

[GitHub] [helix] kaisun2000 opened a new issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

kaisun2000 opened a new issue #1394:
URL: https://github.com/apache/helix/issues/1394


   LOG 1244
   
   >020-09-23T03:25:49.9707808Z [ERROR] stopAndDeleteQueue(org.apache.helix.integration.task.TestTaskRebalancerStopResume)  Time elapsed: 600.045 s  <<< FAILURE!
   2020-09-23T03:25:49.9710243Z org.apache.helix.HelixException: Workflow "stopAndDeleteQueue", job "stopAndDeleteQueue_masterJob" timed out
   2020-09-23T03:25:49.9713274Z 	at org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopAndDeleteQueue(TestTaskRebalancerStopResume.java:436)
   2020-09-23T03:25:49.9715618Z 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-849105254


   Close test unstable tickets since we have an automatic tracking mechanism https://github.com/apache/helix/pull/1757 now for tracking the most recent test issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699573904


   LOG 1795
   
   >2020-09-27T02:06:47.8122232Z org.apache.helix.HelixException: Workflow "stopAndDeleteQueue" context is null or job "stopAndDeleteQueue_slaveJob" is not in states: [COMPLETED]; ctx is ZnRecord=WorkflowContext, {NAME=stopAndDeleteQueue, START_TIME=1601170046460, STATE=IN_PROGRESS}{JOB_STATES={stopAndDeleteQueue_masterJob=COMPLETED}, StartTime={stopAndDeleteQueue_masterJob=1601170046460}}{}, Stat=Stat {_version=0, _creationTime=0, _modifiedTime=0, _ephemeralOwner=0}, jobState is null, wf cfg ZnRecord=stopAndDeleteQueue, {AllowOverlapJobAssignment=false, Dag={"allNodes":["stopAndDeleteQueue_masterJob"],"childrenToParents":{},"parentsToChildren":{}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, TargetState=START, Terminable=false, WorkflowID=stopAndDeleteQueue, capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=3, _creationTime=1601170046441, _modifiedTime=1601170046511, _ephemeralOwner=0}, jobcfg null, j
 bctx null
   2020-09-27T02:06:47.8128562Z 	at org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopAndDeleteQueue(TestTaskRebalancerStopResume.java:458)
   2020-09-27T02:06:47.8130823Z 
   2020-09-27T02:06:48.2084063Z [ERROR] Failures: 
   2020-09-27T02:06:48.2087741Z [ERROR]   TestTaskRebalancerStopResume.stopAndDeleteQueue:458 » Helix Workflow "stopAndD...
   2020-09-27T02:06:48.2089226Z [ERROR] Tests run: 1207, Failures: 1, Errors: 0, Skipped: 1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699334297


   LOG 1605
   
   >2020-09-26T03:31:14.6435435Z [ERROR] Tests run: 1207, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 5,269.987 s <<< FAILURE! - in TestSuite
   2020-09-26T03:31:14.6447448Z [ERROR] stopAndDeleteQueue(org.apache.helix.integration.task.TestTaskRebalancerStopResume)  Time elapsed: 600.853 s  <<< FAILURE!
   2020-09-26T03:31:14.6459709Z org.apache.helix.HelixException: Workflow "stopAndDeleteQueue" context is null or job "stopAndDeleteQueue_slaveJob" is not in states: [COMPLETED]
   2020-09-26T03:31:14.6546658Z 	at org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopAndDeleteQueue(TestTaskRebalancerStopResume.java:456)
   2020-09-26T03:31:14.6552113Z 
   2020-09-26T03:31:15.0677866Z [ERROR] Failures: 
   2020-09-26T03:31:15.0694994Z [ERROR]   TestTaskRebalancerStopResume.stopAndDeleteQueue:456 » Helix Workflow "stopAndD...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699574199


   LOG 1793
   
   >2020-09-27T02:06:59.2960843Z [ERROR] stopAndDeleteQueue(org.apache.helix.integration.task.TestTaskRebalancerStopResume)  Time elapsed: 608.535 s  <<< FAILURE!
   2020-09-27T02:06:59.3100773Z org.apache.helix.HelixException: Workflow "stopAndDeleteQueue" context is null or job "stopAndDeleteQueue_slaveJob" is not in states: [COMPLETED]; ctx is ZnRecord=WorkflowContext, {NAME=stopAndDeleteQueue, START_TIME=1601170052753, STATE=IN_PROGRESS}{JOB_STATES={stopAndDeleteQueue_masterJob=COMPLETED}, StartTime={stopAndDeleteQueue_masterJob=1601170052753}}{}, Stat=Stat {_version=0, _creationTime=0, _modifiedTime=0, _ephemeralOwner=0}, jobState is null, wf cfg ZnRecord=stopAndDeleteQueue, {AllowOverlapJobAssignment=false, Dag={"allNodes":["stopAndDeleteQueue_masterJob"],"childrenToParents":{},"parentsToChildren":{}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, TargetState=START, Terminable=false, WorkflowID=stopAndDeleteQueue, capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=3, _creationTime=1601170052735, _modifiedTime=1601170052814, _ephemeralOwner=0}, jobcfg null, jb
 ctx null
   2020-09-27T02:06:59.3115829Z 	at org.apache.helix.integration.task.TestTaskRebalancerStopResume.stopAndDeleteQueue(TestTaskRebalancerStopResume.java:458)
   2020-09-27T02:06:59.3119284Z 
   2020-09-27T02:06:59.7060789Z [ERROR] Failures: 
   2020-09-27T02:06:59.7064081Z [ERROR]   TestTaskRebalancerStopResume.stopAndDeleteQueue:458 » Helix Workflow "stopAndD...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-698047278


   fix by batch add. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699699921


   Note resource config is 
   
   JOB stopAndDeleteQueue_masterJob config is there
   JOB stopAndDeleteQueue_slave config is not there
   WorkFlow stopAndDeleteQueue config is there with both master/slave jobs. 
   
   This is due to selective update sequence and job adding sequence race condition. This can be reproduced in debugger too.
   Will write a doc about the detail.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699699145


   The case is exactly the same as I can confirm from debugger. 
   LOG 1843
   
   >2020-09-27T21:29:31.1669800Z END: WorkflowControllerDataProvider.refresh() for cluster CLUSTER_TestTaskRebalancerStopResume, pipleline TASK, Cache resrouce config Content:**{stopAndDeleteQueue_masterJob=ZnRecord=stopAndDeleteQueue_masterJob, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalView=false, Expiry=86400000, FailureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"2000"}, JobID=stopAndDeleteQueue_masterJob, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask=10, RebalanceRunningTask=false, TargetPartitionStates=MASTER, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=stopAndDeleteQueue}{}{},** Stat=Stat {_version=0, _creationTime=1601242170561, _modifiedTime=1601242170561, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_masterJob0=ZnRecord=deleteJobFromRecurrentQueueNotStarted_masterJob0, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalView=false, Expiry=86400000, Fail
 ureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"200"}, JobID=deleteJobFromRecurrentQueueNotStarted_masterJob0, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask=10, RebalanceRunningTask=false, TargetPartitionStates=MASTER, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=deleteJobFromRecurrentQueueNotStarted}{}{}, Stat=Stat {_version=0, _creationTime=1601242166392, _modifiedTime=1601242166392, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0=ZnRecord=deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalView=false, Expiry=86400000, FailureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"200"}, JobID=deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask=10, RebalanceRunningTask=false, TargetP
 artitionStates=MASTER, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=deleteJobFromRecurrentQueueNotStarted_20200927T212926}{}{}, Stat=Stat {_version=0, _creationTime=1601242166409, _modifiedTime=1601242166409, _ephemeralOwner=0}, **stopAndDeleteQueue=ZnRecord=stopAndDeleteQueue, {AllowOverlapJobAssignment=false, Dag={"allNodes":["stopAndDeleteQueue_masterJob","stopAndDeleteQueue_slaveJob"],"childrenToParents":{"stopAndDeleteQueue_slaveJob":["stopAndDeleteQueue_masterJob"]},"parentsToChildren":{"stopAndDeleteQueue_masterJob":["stopAndDeleteQueue_slaveJob"]}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, TargetState=START, Terminable=false, WorkflowID=stopAndDeleteQueue,** capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=2, _creationTime=1601242170547, _modifiedTime=1601242170572, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted=ZnRecord=deleteJobFromRecurrentQueueNotStarted, {
 AllowOverlapJobAssignment=false, Dag={"allNodes":["deleteJobFromRecurrentQueueNotStarted_masterJob0","deleteJobFromRecurrentQueueNotStarted_slaveJob1"],"childrenToParents":{"deleteJobFromRecurrentQueueNotStarted_slaveJob1":["deleteJobFromRecurrentQueueNotStarted_masterJob0"]},"parentsToChildren":{"deleteJobFromRecurrentQueueNotStarted_masterJob0":["deleteJobFromRecurrentQueueNotStarted_slaveJob1"]}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, RecurrenceInterval=60, RecurrenceUnit=SECONDS, StartTime=09-27-2020 21:29:26, TargetState=START, Terminable=false, WorkflowID=deleteJobFromRecurrentQueueNotStarted, capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=2, _creationTime=1601242166394, _modifiedTime=1601242170517, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_slaveJob1=ZnRecord=deleteJobFromRecurrentQueueNotStarted_slaveJob1, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalV
 iew=false, Expiry=86400000, FailureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"200"}, JobID=deleteJobFromRecurrentQueueNotStarted_slaveJob1, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask=10, RebalanceRunningTask=false, TargetPartitionStates=SLAVE, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=deleteJobFromRecurrentQueueNotStarted}{}{}, Stat=Stat {_version=0, _creationTime=1601242166392, _modifiedTime=1601242166392, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_20200927T212926=ZnRecord=deleteJobFromRecurrentQueueNotStarted_20200927T212926, {AllowOverlapJobAssignment=false, Dag={"allNodes":["deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0","deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1"],"childrenToParents":{"deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1":["deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0"]},"parentsToChildren"
 :{"deleteJobFromRecurrentQueueNotStarted_20200927T212926_masterJob0":["deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1"]}}, Expiry=120000, FailureThreshold=0, IsJobQueue=true, JobPurgeInterval=1800000, MONITORING_DISABLED=true, ParallelJobs=1, StartTime=09-27-2020 21:29:26, TargetState=START, Terminable=true, WorkflowID=deleteJobFromRecurrentQueueNotStarted_20200927T212926, capacity=2147483647}{JobTypes={}}{}, Stat=Stat {_version=0, _creationTime=1601242166410, _modifiedTime=1601242166410, _ephemeralOwner=0}, deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1=ZnRecord=deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1, {Command=Reindex, ConcurrentTasksPerInstance=1, DisableExternalView=false, Expiry=86400000, FailureThreshold=0, IgnoreDependentJobFailure=false, JobCommandConfig={"Delay":"200"}, JobID=deleteJobFromRecurrentQueueNotStarted_20200927T212926_slaveJob1, MONITORING_DISABLED=true, MaxAttemptsPerTask=10, MaxForcedReassignmentsPerTask
 =10, RebalanceRunningTask=false, TargetPartitionStates=SLAVE, TargetResource=TestDB, TimeoutPerPartition=3600000, WorkflowID=deleteJobFromRecurrentQueueNotStarted_20200927T212926}{}{}, Stat=Stat {_version=0, _creationTime=1601242166410, _modifiedTime=1601242166410, _ephemeralOwner=0}}
   2020-09-27T21:29:31.1710456Z Job stopAndDeleteQueue_slaveJob exists in jobdag bug job config missing, expire the job
   2020-09-27T21:29:31.1711452Z removed job config:/TaskRebalancer/stopAndDeleteQueue_slaveJob


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] kaisun2000 commented on issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

Posted by GitBox <gi...@apache.org>.
kaisun2000 commented on issue #1394:
URL: https://github.com/apache/helix/issues/1394#issuecomment-699338155


   1. enhance log to see what state job is or ctx is null
   ```
        if (ctx == null || !allowedStates.contains(ctx.getJobState(jobName))) {
          throw new HelixException(
   -          String.format("Workflow \"%s\" context is null or job \"%s\" is not in states: %s",
   -              workflowName, jobName, allowedStates));
   +          String.format("Workflow \"%s\" context is null or job \"%s\" is not in states: %s; ctx is %s, jobState is %s .",
   +              workflowName, jobName, allowedStates, ctx == null ? "null" : ctx, ctx != null ? ctx.getJobState(jobName) : "null"));
        }
   ```
   
   2. don't let job finish too quickly.
   ```
       Set<String> slave = Sets.newHashSet("SLAVE");
       JobConfig.Builder job2 = new JobConfig.Builder().setCommand(MockTask.TASK_COMMAND)
           .setJobCommandConfigMap(Collections.singletonMap(MockTask.JOB_DELAY, "2000"))
           .setTargetResource(WorkflowGenerator.DEFAULT_TGT_DB).setTargetPartitionStates(slave);
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang closed issue #1394: fix TestTaskRebalancerStopResume.stopAndDeleteQueue

Posted by GitBox <gi...@apache.org>.
jiajunwang closed issue #1394:
URL: https://github.com/apache/helix/issues/1394


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org