You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by priyank sharma <pr...@orkash.com> on 2017/11/09 10:19:15 UTC

DUCC's job goes into infintie loop

All!

I have a problem regarding DUCC cluster in which a job process gets 
stuck and keeps on processing the same batch again and again due to 
maximum duration the batch gets reason or extraordinary status 
*"**CanceledByUser" *and then gets restarted with the same ID's. This 
usually happens after 15 to 20 days and goes away after restarting the 
ducc cluster. While going through the data store that is being used by 
CAS consumer to ingest data, the data regarding this batch does never 
get ingested. So most probably this data is not being processed.

How to check if this data is being processed or not?

Are the resources the issue and why it is being processed after 
restarting the cluster?

We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.



-- 
Thanks and Regards
Priyank Sharma


Re: DUCC's job goes into infintie loop

Posted by Lou DeGenaro <lo...@gmail.com>.
Are you running with a shared file system on your cluster?  Is your user
log directory located there?  Look at the DUCC daemon log files located in
$DUCC_HOME/logs. They should provide some clues as to what is wrong.  Feel
free to post (non-confidential versions of) them here for a second opinion.

Lou.

On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <pr...@orkash.com>
wrote:

> There is nothing on the work item page and performance page on the web
> server. There is only one log file for the main node, no log files for
> other two nodes. Ducc job processes not able to pick the data from the data
> source and no UIMA aggregator is working for that batches.
>
> Are the issue because of the java heap space? We are giving 4gb ram to the
> job-process.
>
> Attaching the Log file.
>
> Thanks and Regards
> Priyank Sharma
>
> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>
>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>> page ducchost:42133/jobs.jsp then click on the id of your job.  Examine
>> the
>> logs by clicking on each log file name looking for any revealing
>> information.
>>
>> Feel free to post non-confidential snippets here, or If you'd like to chat
>> in real time we can use hipchat.
>>
>> Lou.
>>
>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <priyank.sharma@orkash.com
>> >
>> wrote:
>>
>> All!
>>>
>>> I have a problem regarding DUCC cluster in which a job process gets stuck
>>> and keeps on processing the same batch again and again due to maximum
>>> duration the batch gets reason or extraordinary status
>>> *"**CanceledByUser"
>>> *and then gets restarted with the same ID's. This usually happens after
>>> 15
>>> to 20 days and goes away after restarting the ducc cluster. While going
>>> through the data store that is being used by CAS consumer to ingest data,
>>> the data regarding this batch does never get ingested. So most probably
>>> this data is not being processed.
>>>
>>> How to check if this data is being processed or not?
>>>
>>> Are the resources the issue and why it is being processed after
>>> restarting
>>> the cluster?
>>>
>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>> Priyank Sharma
>>>
>>>
>>>
>

Re: DUCC's job goes into infintie loop

Posted by Lou DeGenaro <lo...@gmail.com>.
I'm sorry, but it is still not clear to me.

You need to give step by step instructions on precisely what you are doing
and the events that occur so I can re-create the problem, and/or you need
to supply DUCC daemons and user logs and say which Job or Service number is
not behaving properly.

The best advise is to upgrade to the latest version of DUCC and see if your
problem still exists.

Lou.





On Wed, Nov 15, 2017 at 5:49 AM, priyank sharma <pr...@orkash.com>
wrote:

> My complaint is that the job doesn't able to process any data as we gave
> maximum 75 minutes for a job to process, after every 75 minutes a new job
> starts it is having the same batch of ids as the previous job and this
> behaviour of the job continues untill we restarts the DUCC.
>
> Last time when this happened, that time one of our machine out of three
> was down and disconnected from the cluster.
>
> The machine which was down having all the UIMA-AS services deployed on it.
> Uima-AS services are used to process the data.
>
> When the machine is down then uimaas services should allocate to the other
> two machine, may be DUCC fails to allocate the uimaas services to the other
> two machine due to which the job may not able to process data.
>
> The problem still is with the job but i am trying to explain you every
> possible thing that happened to my cluster when the problem arise to make
> it easy for you to help me.
>
> Are my problem clear to you now???
>
> Thanks and Regards
> Priyank Sharma
>
> On Wednesday 15 November 2017 03:57 PM, Lou DeGenaro wrote:
>
>> Please note that we make a clear distinction between "services" and
>> "jobs".  Earlier e-mail from you suggested that your trouble was related
>> to
>> jobs.
>>
>> Here is my understanding of you situation.  You use ducc_submit to submit
>> a
>> job comprising several work items.  DUCC starts three Job Processes all on
>> the same machine and these are successfully processing work items.  At
>> some
>> point before all work items are completed the connection to that machine
>> is
>> lost.  And at this point the trouble for you begins...is this correct?
>>
>> DUCC should detect that the lost contact machine is down, and if there is
>> space on other machine(s) it should allocate new Job Processes to continue
>> the work.  However, the disconnected machine may continue processing any
>> work items is was working on prior to losing connectivity, so it is
>> possible that the same work items may have overlapping processing.  Is
>> overlapping processing of the same work items your complaint?
>>
>> Lou.
>>
>> On Tue, Nov 14, 2017 at 11:00 PM, priyank sharma <
>> priyank.sharma@orkash.com>
>> wrote:
>>
>> server down mean one out of three machine is disconnected from the cluster
>>> of three and all the services were deployed on the machine which was
>>> disconnect from the cluster.
>>>
>>> Thanks and Regards
>>> Priyank Sharma
>>>
>>> On Tuesday 14 November 2017 04:08 PM, Lou DeGenaro wrote:
>>>
>>> What do you mean by "server down", precisely?  Since we have no logs to
>>>> look at we can only go by your descriptions.  We're trying to help...
>>>>
>>>> Lou.
>>>>
>>>> On Mon, Nov 13, 2017 at 11:30 PM, priyank sharma <
>>>> priyank.sharma@orkash.com>
>>>> wrote:
>>>>
>>>> When our job goes into infinite-loop that time uima analysis engine did
>>>>
>>>>> not start and one of the server out of three were down that server has
>>>>> all
>>>>> the service which is being used by the uima analysis engine.
>>>>>
>>>>> Is the server down creates this issue?
>>>>>
>>>>> is memory the problem?
>>>>>
>>>>> Thanks and Regards
>>>>> Priyank Sharma
>>>>>
>>>>> On Monday 13 November 2017 07:38 PM, Eddie Epstein wrote:
>>>>>
>>>>> Several different issues here. There is no "job completion cap", rather
>>>>>
>>>>>> there is a limit on how long an individual work item will be allowed
>>>>>> to
>>>>>> process before it is labeled a timeout. The default number of such
>>>>>> errors
>>>>>> +
>>>>>> exceptions before a Job is stopped is 15. Please increase this cap if
>>>>>> you
>>>>>> expect a work item to go longer.
>>>>>>
>>>>>> If a job process runs out of heap space it should go OOM at which
>>>>>> point
>>>>>> unpredictable things will happen.  Do you see OOM exceptions in the JP
>>>>>> logfiles?
>>>>>>
>>>>>> As for a bug, it is still hard to understand what is happening. Newer
>>>>>> versions of DUCC include a ducc_gather_logs command that collects DUCC
>>>>>> daemon logfiles and state and makes it more likely we can understand
>>>>>> what
>>>>>> is happening. No user application logfiles are included in the
>>>>>> captured
>>>>>> tar
>>>>>> file.
>>>>>>
>>>>>> Regards,
>>>>>> Eddie
>>>>>>
>>>>>> On Mon, Nov 13, 2017 at 12:33 AM, priyank sharma <
>>>>>> priyank.sharma@orkash.com>
>>>>>> wrote:
>>>>>>
>>>>>> Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram,
>>>>>>
>>>>>> 40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes
>>>>>>> into
>>>>>>> the infinite loop with the same batch of the id's. We have a 75
>>>>>>> minutes
>>>>>>> cap
>>>>>>> for a job to complete if not then its start again so after every 75
>>>>>>> minutes
>>>>>>> new job starts but with the same id batch as previous and not even a
>>>>>>> single
>>>>>>> document ingested in the data store it goes in the same state untill
>>>>>>> we
>>>>>>> restarts the server.
>>>>>>>
>>>>>>> Is this because of the DUCC v2.0.1, are this version of DUCC having
>>>>>>> that
>>>>>>> bug?
>>>>>>>
>>>>>>> Is this problem occur because of the Java Heap Space?
>>>>>>>
>>>>>>> Please suggest something as there are nothing in the logs regarding
>>>>>>> to
>>>>>>> my
>>>>>>> problem.
>>>>>>>
>>>>>>> Thanks and Regards
>>>>>>> Priyank Sharma
>>>>>>>
>>>>>>> On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
>>>>>>>
>>>>>>> Hi Priyank,
>>>>>>>
>>>>>>> Looks like you are running DUCC v2.0.x. There are so many bugs fixed
>>>>>>>> in
>>>>>>>> subsequent versions, the latest being v2.2.1. Newer versions have a
>>>>>>>> ducc_update command that will upgrade an existing install, but given
>>>>>>>> all
>>>>>>>> the changes since v2.0.x I suggest a clean install.
>>>>>>>>
>>>>>>>> Eddie
>>>>>>>>
>>>>>>>> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <
>>>>>>>> priyank.sharma@orkash.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> There is nothing on the work item page and performance page on the
>>>>>>>> web
>>>>>>>>
>>>>>>>> server. There is only one log file for the main node, no log files
>>>>>>>> for
>>>>>>>>
>>>>>>>>> other two nodes. Ducc job processes not able to pick the data from
>>>>>>>>> the
>>>>>>>>> data
>>>>>>>>> source and no UIMA aggregator is working for that batches.
>>>>>>>>>
>>>>>>>>> Are the issue because of the java heap space? We are giving 4gb ram
>>>>>>>>> to
>>>>>>>>> the
>>>>>>>>> job-process.
>>>>>>>>>
>>>>>>>>> Attaching the Log file.
>>>>>>>>>
>>>>>>>>> Thanks and Regards
>>>>>>>>> Priyank Sharma
>>>>>>>>>
>>>>>>>>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>>>>>>>>
>>>>>>>>> The first place to look is in your job's logs.  Visit the ducc-mon
>>>>>>>>> jobs
>>>>>>>>>
>>>>>>>>> page ducchost:42133/jobs.jsp then click on the id of your job.
>>>>>>>>>
>>>>>>>>>> Examine
>>>>>>>>>> the
>>>>>>>>>> logs by clicking on each log file name looking for any revealing
>>>>>>>>>> information.
>>>>>>>>>>
>>>>>>>>>> Feel free to post non-confidential snippets here, or If you'd like
>>>>>>>>>> to
>>>>>>>>>> chat
>>>>>>>>>> in real time we can use hipchat.
>>>>>>>>>>
>>>>>>>>>> Lou.
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <
>>>>>>>>>> priyank.sharma@orkash.com
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> All!
>>>>>>>>>>
>>>>>>>>>> I have a problem regarding DUCC cluster in which a job process
>>>>>>>>>> gets
>>>>>>>>>>
>>>>>>>>>> stuck
>>>>>>>>>>> and keeps on processing the same batch again and again due to
>>>>>>>>>>> maximum
>>>>>>>>>>> duration the batch gets reason or extraordinary status
>>>>>>>>>>> *"**CanceledByUser"
>>>>>>>>>>> *and then gets restarted with the same ID's. This usually happens
>>>>>>>>>>> after
>>>>>>>>>>> 15
>>>>>>>>>>> to 20 days and goes away after restarting the ducc cluster. While
>>>>>>>>>>> going
>>>>>>>>>>> through the data store that is being used by CAS consumer to
>>>>>>>>>>> ingest
>>>>>>>>>>> data,
>>>>>>>>>>> the data regarding this batch does never get ingested. So most
>>>>>>>>>>> probably
>>>>>>>>>>> this data is not being processed.
>>>>>>>>>>>
>>>>>>>>>>> How to check if this data is being processed or not?
>>>>>>>>>>>
>>>>>>>>>>> Are the resources the issue and why it is being processed after
>>>>>>>>>>> restarting
>>>>>>>>>>> the cluster?
>>>>>>>>>>>
>>>>>>>>>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb
>>>>>>>>>>> ram.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks and Regards
>>>>>>>>>>> Priyank Sharma
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>

Re: DUCC's job goes into infintie loop

Posted by priyank sharma <pr...@orkash.com>.
My complaint is that the job doesn't able to process any data as we gave 
maximum 75 minutes for a job to process, after every 75 minutes a new 
job starts it is having the same batch of ids as the previous job and 
this behaviour of the job continues untill we restarts the DUCC.

Last time when this happened, that time one of our machine out of three 
was down and disconnected from the cluster.

The machine which was down having all the UIMA-AS services deployed on 
it. Uima-AS services are used to process the data.

When the machine is down then uimaas services should allocate to the 
other two machine, may be DUCC fails to allocate the uimaas services to 
the other two machine due to which the job may not able to process data.

The problem still is with the job but i am trying to explain you every 
possible thing that happened to my cluster when the problem arise to 
make it easy for you to help me.

Are my problem clear to you now???

Thanks and Regards
Priyank Sharma

On Wednesday 15 November 2017 03:57 PM, Lou DeGenaro wrote:
> Please note that we make a clear distinction between "services" and
> "jobs".  Earlier e-mail from you suggested that your trouble was related to
> jobs.
>
> Here is my understanding of you situation.  You use ducc_submit to submit a
> job comprising several work items.  DUCC starts three Job Processes all on
> the same machine and these are successfully processing work items.  At some
> point before all work items are completed the connection to that machine is
> lost.  And at this point the trouble for you begins...is this correct?
>
> DUCC should detect that the lost contact machine is down, and if there is
> space on other machine(s) it should allocate new Job Processes to continue
> the work.  However, the disconnected machine may continue processing any
> work items is was working on prior to losing connectivity, so it is
> possible that the same work items may have overlapping processing.  Is
> overlapping processing of the same work items your complaint?
>
> Lou.
>
> On Tue, Nov 14, 2017 at 11:00 PM, priyank sharma <pr...@orkash.com>
> wrote:
>
>> server down mean one out of three machine is disconnected from the cluster
>> of three and all the services were deployed on the machine which was
>> disconnect from the cluster.
>>
>> Thanks and Regards
>> Priyank Sharma
>>
>> On Tuesday 14 November 2017 04:08 PM, Lou DeGenaro wrote:
>>
>>> What do you mean by "server down", precisely?  Since we have no logs to
>>> look at we can only go by your descriptions.  We're trying to help...
>>>
>>> Lou.
>>>
>>> On Mon, Nov 13, 2017 at 11:30 PM, priyank sharma <
>>> priyank.sharma@orkash.com>
>>> wrote:
>>>
>>> When our job goes into infinite-loop that time uima analysis engine did
>>>> not start and one of the server out of three were down that server has
>>>> all
>>>> the service which is being used by the uima analysis engine.
>>>>
>>>> Is the server down creates this issue?
>>>>
>>>> is memory the problem?
>>>>
>>>> Thanks and Regards
>>>> Priyank Sharma
>>>>
>>>> On Monday 13 November 2017 07:38 PM, Eddie Epstein wrote:
>>>>
>>>> Several different issues here. There is no "job completion cap", rather
>>>>> there is a limit on how long an individual work item will be allowed to
>>>>> process before it is labeled a timeout. The default number of such
>>>>> errors
>>>>> +
>>>>> exceptions before a Job is stopped is 15. Please increase this cap if
>>>>> you
>>>>> expect a work item to go longer.
>>>>>
>>>>> If a job process runs out of heap space it should go OOM at which point
>>>>> unpredictable things will happen.  Do you see OOM exceptions in the JP
>>>>> logfiles?
>>>>>
>>>>> As for a bug, it is still hard to understand what is happening. Newer
>>>>> versions of DUCC include a ducc_gather_logs command that collects DUCC
>>>>> daemon logfiles and state and makes it more likely we can understand
>>>>> what
>>>>> is happening. No user application logfiles are included in the captured
>>>>> tar
>>>>> file.
>>>>>
>>>>> Regards,
>>>>> Eddie
>>>>>
>>>>> On Mon, Nov 13, 2017 at 12:33 AM, priyank sharma <
>>>>> priyank.sharma@orkash.com>
>>>>> wrote:
>>>>>
>>>>> Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram,
>>>>>
>>>>>> 40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes
>>>>>> into
>>>>>> the infinite loop with the same batch of the id's. We have a 75 minutes
>>>>>> cap
>>>>>> for a job to complete if not then its start again so after every 75
>>>>>> minutes
>>>>>> new job starts but with the same id batch as previous and not even a
>>>>>> single
>>>>>> document ingested in the data store it goes in the same state untill we
>>>>>> restarts the server.
>>>>>>
>>>>>> Is this because of the DUCC v2.0.1, are this version of DUCC having
>>>>>> that
>>>>>> bug?
>>>>>>
>>>>>> Is this problem occur because of the Java Heap Space?
>>>>>>
>>>>>> Please suggest something as there are nothing in the logs regarding to
>>>>>> my
>>>>>> problem.
>>>>>>
>>>>>> Thanks and Regards
>>>>>> Priyank Sharma
>>>>>>
>>>>>> On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
>>>>>>
>>>>>> Hi Priyank,
>>>>>>
>>>>>>> Looks like you are running DUCC v2.0.x. There are so many bugs fixed
>>>>>>> in
>>>>>>> subsequent versions, the latest being v2.2.1. Newer versions have a
>>>>>>> ducc_update command that will upgrade an existing install, but given
>>>>>>> all
>>>>>>> the changes since v2.0.x I suggest a clean install.
>>>>>>>
>>>>>>> Eddie
>>>>>>>
>>>>>>> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <
>>>>>>> priyank.sharma@orkash.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> There is nothing on the work item page and performance page on the web
>>>>>>>
>>>>>>> server. There is only one log file for the main node, no log files for
>>>>>>>> other two nodes. Ducc job processes not able to pick the data from
>>>>>>>> the
>>>>>>>> data
>>>>>>>> source and no UIMA aggregator is working for that batches.
>>>>>>>>
>>>>>>>> Are the issue because of the java heap space? We are giving 4gb ram
>>>>>>>> to
>>>>>>>> the
>>>>>>>> job-process.
>>>>>>>>
>>>>>>>> Attaching the Log file.
>>>>>>>>
>>>>>>>> Thanks and Regards
>>>>>>>> Priyank Sharma
>>>>>>>>
>>>>>>>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>>>>>>>
>>>>>>>> The first place to look is in your job's logs.  Visit the ducc-mon
>>>>>>>> jobs
>>>>>>>>
>>>>>>>> page ducchost:42133/jobs.jsp then click on the id of your job.
>>>>>>>>> Examine
>>>>>>>>> the
>>>>>>>>> logs by clicking on each log file name looking for any revealing
>>>>>>>>> information.
>>>>>>>>>
>>>>>>>>> Feel free to post non-confidential snippets here, or If you'd like
>>>>>>>>> to
>>>>>>>>> chat
>>>>>>>>> in real time we can use hipchat.
>>>>>>>>>
>>>>>>>>> Lou.
>>>>>>>>>
>>>>>>>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <
>>>>>>>>> priyank.sharma@orkash.com
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> All!
>>>>>>>>>
>>>>>>>>> I have a problem regarding DUCC cluster in which a job process gets
>>>>>>>>>
>>>>>>>>>> stuck
>>>>>>>>>> and keeps on processing the same batch again and again due to
>>>>>>>>>> maximum
>>>>>>>>>> duration the batch gets reason or extraordinary status
>>>>>>>>>> *"**CanceledByUser"
>>>>>>>>>> *and then gets restarted with the same ID's. This usually happens
>>>>>>>>>> after
>>>>>>>>>> 15
>>>>>>>>>> to 20 days and goes away after restarting the ducc cluster. While
>>>>>>>>>> going
>>>>>>>>>> through the data store that is being used by CAS consumer to ingest
>>>>>>>>>> data,
>>>>>>>>>> the data regarding this batch does never get ingested. So most
>>>>>>>>>> probably
>>>>>>>>>> this data is not being processed.
>>>>>>>>>>
>>>>>>>>>> How to check if this data is being processed or not?
>>>>>>>>>>
>>>>>>>>>> Are the resources the issue and why it is being processed after
>>>>>>>>>> restarting
>>>>>>>>>> the cluster?
>>>>>>>>>>
>>>>>>>>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks and Regards
>>>>>>>>>> Priyank Sharma
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>


Re: DUCC's job goes into infintie loop

Posted by Lou DeGenaro <lo...@gmail.com>.
Please note that we make a clear distinction between "services" and
"jobs".  Earlier e-mail from you suggested that your trouble was related to
jobs.

Here is my understanding of you situation.  You use ducc_submit to submit a
job comprising several work items.  DUCC starts three Job Processes all on
the same machine and these are successfully processing work items.  At some
point before all work items are completed the connection to that machine is
lost.  And at this point the trouble for you begins...is this correct?

DUCC should detect that the lost contact machine is down, and if there is
space on other machine(s) it should allocate new Job Processes to continue
the work.  However, the disconnected machine may continue processing any
work items is was working on prior to losing connectivity, so it is
possible that the same work items may have overlapping processing.  Is
overlapping processing of the same work items your complaint?

Lou.

On Tue, Nov 14, 2017 at 11:00 PM, priyank sharma <pr...@orkash.com>
wrote:

> server down mean one out of three machine is disconnected from the cluster
> of three and all the services were deployed on the machine which was
> disconnect from the cluster.
>
> Thanks and Regards
> Priyank Sharma
>
> On Tuesday 14 November 2017 04:08 PM, Lou DeGenaro wrote:
>
>> What do you mean by "server down", precisely?  Since we have no logs to
>> look at we can only go by your descriptions.  We're trying to help...
>>
>> Lou.
>>
>> On Mon, Nov 13, 2017 at 11:30 PM, priyank sharma <
>> priyank.sharma@orkash.com>
>> wrote:
>>
>> When our job goes into infinite-loop that time uima analysis engine did
>>> not start and one of the server out of three were down that server has
>>> all
>>> the service which is being used by the uima analysis engine.
>>>
>>> Is the server down creates this issue?
>>>
>>> is memory the problem?
>>>
>>> Thanks and Regards
>>> Priyank Sharma
>>>
>>> On Monday 13 November 2017 07:38 PM, Eddie Epstein wrote:
>>>
>>> Several different issues here. There is no "job completion cap", rather
>>>> there is a limit on how long an individual work item will be allowed to
>>>> process before it is labeled a timeout. The default number of such
>>>> errors
>>>> +
>>>> exceptions before a Job is stopped is 15. Please increase this cap if
>>>> you
>>>> expect a work item to go longer.
>>>>
>>>> If a job process runs out of heap space it should go OOM at which point
>>>> unpredictable things will happen.  Do you see OOM exceptions in the JP
>>>> logfiles?
>>>>
>>>> As for a bug, it is still hard to understand what is happening. Newer
>>>> versions of DUCC include a ducc_gather_logs command that collects DUCC
>>>> daemon logfiles and state and makes it more likely we can understand
>>>> what
>>>> is happening. No user application logfiles are included in the captured
>>>> tar
>>>> file.
>>>>
>>>> Regards,
>>>> Eddie
>>>>
>>>> On Mon, Nov 13, 2017 at 12:33 AM, priyank sharma <
>>>> priyank.sharma@orkash.com>
>>>> wrote:
>>>>
>>>> Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram,
>>>>
>>>>> 40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes
>>>>> into
>>>>> the infinite loop with the same batch of the id's. We have a 75 minutes
>>>>> cap
>>>>> for a job to complete if not then its start again so after every 75
>>>>> minutes
>>>>> new job starts but with the same id batch as previous and not even a
>>>>> single
>>>>> document ingested in the data store it goes in the same state untill we
>>>>> restarts the server.
>>>>>
>>>>> Is this because of the DUCC v2.0.1, are this version of DUCC having
>>>>> that
>>>>> bug?
>>>>>
>>>>> Is this problem occur because of the Java Heap Space?
>>>>>
>>>>> Please suggest something as there are nothing in the logs regarding to
>>>>> my
>>>>> problem.
>>>>>
>>>>> Thanks and Regards
>>>>> Priyank Sharma
>>>>>
>>>>> On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
>>>>>
>>>>> Hi Priyank,
>>>>>
>>>>>> Looks like you are running DUCC v2.0.x. There are so many bugs fixed
>>>>>> in
>>>>>> subsequent versions, the latest being v2.2.1. Newer versions have a
>>>>>> ducc_update command that will upgrade an existing install, but given
>>>>>> all
>>>>>> the changes since v2.0.x I suggest a clean install.
>>>>>>
>>>>>> Eddie
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <
>>>>>> priyank.sharma@orkash.com>
>>>>>> wrote:
>>>>>>
>>>>>> There is nothing on the work item page and performance page on the web
>>>>>>
>>>>>> server. There is only one log file for the main node, no log files for
>>>>>>> other two nodes. Ducc job processes not able to pick the data from
>>>>>>> the
>>>>>>> data
>>>>>>> source and no UIMA aggregator is working for that batches.
>>>>>>>
>>>>>>> Are the issue because of the java heap space? We are giving 4gb ram
>>>>>>> to
>>>>>>> the
>>>>>>> job-process.
>>>>>>>
>>>>>>> Attaching the Log file.
>>>>>>>
>>>>>>> Thanks and Regards
>>>>>>> Priyank Sharma
>>>>>>>
>>>>>>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>>>>>>
>>>>>>> The first place to look is in your job's logs.  Visit the ducc-mon
>>>>>>> jobs
>>>>>>>
>>>>>>> page ducchost:42133/jobs.jsp then click on the id of your job.
>>>>>>>> Examine
>>>>>>>> the
>>>>>>>> logs by clicking on each log file name looking for any revealing
>>>>>>>> information.
>>>>>>>>
>>>>>>>> Feel free to post non-confidential snippets here, or If you'd like
>>>>>>>> to
>>>>>>>> chat
>>>>>>>> in real time we can use hipchat.
>>>>>>>>
>>>>>>>> Lou.
>>>>>>>>
>>>>>>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <
>>>>>>>> priyank.sharma@orkash.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> All!
>>>>>>>>
>>>>>>>> I have a problem regarding DUCC cluster in which a job process gets
>>>>>>>>
>>>>>>>>> stuck
>>>>>>>>> and keeps on processing the same batch again and again due to
>>>>>>>>> maximum
>>>>>>>>> duration the batch gets reason or extraordinary status
>>>>>>>>> *"**CanceledByUser"
>>>>>>>>> *and then gets restarted with the same ID's. This usually happens
>>>>>>>>> after
>>>>>>>>> 15
>>>>>>>>> to 20 days and goes away after restarting the ducc cluster. While
>>>>>>>>> going
>>>>>>>>> through the data store that is being used by CAS consumer to ingest
>>>>>>>>> data,
>>>>>>>>> the data regarding this batch does never get ingested. So most
>>>>>>>>> probably
>>>>>>>>> this data is not being processed.
>>>>>>>>>
>>>>>>>>> How to check if this data is being processed or not?
>>>>>>>>>
>>>>>>>>> Are the resources the issue and why it is being processed after
>>>>>>>>> restarting
>>>>>>>>> the cluster?
>>>>>>>>>
>>>>>>>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks and Regards
>>>>>>>>> Priyank Sharma
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>

Re: DUCC's job goes into infintie loop

Posted by priyank sharma <pr...@orkash.com>.
server down mean one out of three machine is disconnected from the 
cluster of three and all the services were deployed on the machine which 
was disconnect from the cluster.

Thanks and Regards
Priyank Sharma

On Tuesday 14 November 2017 04:08 PM, Lou DeGenaro wrote:
> What do you mean by "server down", precisely?  Since we have no logs to
> look at we can only go by your descriptions.  We're trying to help...
>
> Lou.
>
> On Mon, Nov 13, 2017 at 11:30 PM, priyank sharma <pr...@orkash.com>
> wrote:
>
>> When our job goes into infinite-loop that time uima analysis engine did
>> not start and one of the server out of three were down that server has all
>> the service which is being used by the uima analysis engine.
>>
>> Is the server down creates this issue?
>>
>> is memory the problem?
>>
>> Thanks and Regards
>> Priyank Sharma
>>
>> On Monday 13 November 2017 07:38 PM, Eddie Epstein wrote:
>>
>>> Several different issues here. There is no "job completion cap", rather
>>> there is a limit on how long an individual work item will be allowed to
>>> process before it is labeled a timeout. The default number of such errors
>>> +
>>> exceptions before a Job is stopped is 15. Please increase this cap if you
>>> expect a work item to go longer.
>>>
>>> If a job process runs out of heap space it should go OOM at which point
>>> unpredictable things will happen.  Do you see OOM exceptions in the JP
>>> logfiles?
>>>
>>> As for a bug, it is still hard to understand what is happening. Newer
>>> versions of DUCC include a ducc_gather_logs command that collects DUCC
>>> daemon logfiles and state and makes it more likely we can understand what
>>> is happening. No user application logfiles are included in the captured
>>> tar
>>> file.
>>>
>>> Regards,
>>> Eddie
>>>
>>> On Mon, Nov 13, 2017 at 12:33 AM, priyank sharma <
>>> priyank.sharma@orkash.com>
>>> wrote:
>>>
>>> Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram,
>>>> 40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes
>>>> into
>>>> the infinite loop with the same batch of the id's. We have a 75 minutes
>>>> cap
>>>> for a job to complete if not then its start again so after every 75
>>>> minutes
>>>> new job starts but with the same id batch as previous and not even a
>>>> single
>>>> document ingested in the data store it goes in the same state untill we
>>>> restarts the server.
>>>>
>>>> Is this because of the DUCC v2.0.1, are this version of DUCC having that
>>>> bug?
>>>>
>>>> Is this problem occur because of the Java Heap Space?
>>>>
>>>> Please suggest something as there are nothing in the logs regarding to my
>>>> problem.
>>>>
>>>> Thanks and Regards
>>>> Priyank Sharma
>>>>
>>>> On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
>>>>
>>>> Hi Priyank,
>>>>> Looks like you are running DUCC v2.0.x. There are so many bugs fixed in
>>>>> subsequent versions, the latest being v2.2.1. Newer versions have a
>>>>> ducc_update command that will upgrade an existing install, but given all
>>>>> the changes since v2.0.x I suggest a clean install.
>>>>>
>>>>> Eddie
>>>>>
>>>>> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <
>>>>> priyank.sharma@orkash.com>
>>>>> wrote:
>>>>>
>>>>> There is nothing on the work item page and performance page on the web
>>>>>
>>>>>> server. There is only one log file for the main node, no log files for
>>>>>> other two nodes. Ducc job processes not able to pick the data from the
>>>>>> data
>>>>>> source and no UIMA aggregator is working for that batches.
>>>>>>
>>>>>> Are the issue because of the java heap space? We are giving 4gb ram to
>>>>>> the
>>>>>> job-process.
>>>>>>
>>>>>> Attaching the Log file.
>>>>>>
>>>>>> Thanks and Regards
>>>>>> Priyank Sharma
>>>>>>
>>>>>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>>>>>
>>>>>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>>>>>>
>>>>>>> page ducchost:42133/jobs.jsp then click on the id of your job.
>>>>>>> Examine
>>>>>>> the
>>>>>>> logs by clicking on each log file name looking for any revealing
>>>>>>> information.
>>>>>>>
>>>>>>> Feel free to post non-confidential snippets here, or If you'd like to
>>>>>>> chat
>>>>>>> in real time we can use hipchat.
>>>>>>>
>>>>>>> Lou.
>>>>>>>
>>>>>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <
>>>>>>> priyank.sharma@orkash.com
>>>>>>> wrote:
>>>>>>>
>>>>>>> All!
>>>>>>>
>>>>>>> I have a problem regarding DUCC cluster in which a job process gets
>>>>>>>> stuck
>>>>>>>> and keeps on processing the same batch again and again due to maximum
>>>>>>>> duration the batch gets reason or extraordinary status
>>>>>>>> *"**CanceledByUser"
>>>>>>>> *and then gets restarted with the same ID's. This usually happens
>>>>>>>> after
>>>>>>>> 15
>>>>>>>> to 20 days and goes away after restarting the ducc cluster. While
>>>>>>>> going
>>>>>>>> through the data store that is being used by CAS consumer to ingest
>>>>>>>> data,
>>>>>>>> the data regarding this batch does never get ingested. So most
>>>>>>>> probably
>>>>>>>> this data is not being processed.
>>>>>>>>
>>>>>>>> How to check if this data is being processed or not?
>>>>>>>>
>>>>>>>> Are the resources the issue and why it is being processed after
>>>>>>>> restarting
>>>>>>>> the cluster?
>>>>>>>>
>>>>>>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks and Regards
>>>>>>>> Priyank Sharma
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>


Re: DUCC's job goes into infintie loop

Posted by Lou DeGenaro <lo...@gmail.com>.
What do you mean by "server down", precisely?  Since we have no logs to
look at we can only go by your descriptions.  We're trying to help...

Lou.

On Mon, Nov 13, 2017 at 11:30 PM, priyank sharma <pr...@orkash.com>
wrote:

> When our job goes into infinite-loop that time uima analysis engine did
> not start and one of the server out of three were down that server has all
> the service which is being used by the uima analysis engine.
>
> Is the server down creates this issue?
>
> is memory the problem?
>
> Thanks and Regards
> Priyank Sharma
>
> On Monday 13 November 2017 07:38 PM, Eddie Epstein wrote:
>
>> Several different issues here. There is no "job completion cap", rather
>> there is a limit on how long an individual work item will be allowed to
>> process before it is labeled a timeout. The default number of such errors
>> +
>> exceptions before a Job is stopped is 15. Please increase this cap if you
>> expect a work item to go longer.
>>
>> If a job process runs out of heap space it should go OOM at which point
>> unpredictable things will happen.  Do you see OOM exceptions in the JP
>> logfiles?
>>
>> As for a bug, it is still hard to understand what is happening. Newer
>> versions of DUCC include a ducc_gather_logs command that collects DUCC
>> daemon logfiles and state and makes it more likely we can understand what
>> is happening. No user application logfiles are included in the captured
>> tar
>> file.
>>
>> Regards,
>> Eddie
>>
>> On Mon, Nov 13, 2017 at 12:33 AM, priyank sharma <
>> priyank.sharma@orkash.com>
>> wrote:
>>
>> Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram,
>>> 40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes
>>> into
>>> the infinite loop with the same batch of the id's. We have a 75 minutes
>>> cap
>>> for a job to complete if not then its start again so after every 75
>>> minutes
>>> new job starts but with the same id batch as previous and not even a
>>> single
>>> document ingested in the data store it goes in the same state untill we
>>> restarts the server.
>>>
>>> Is this because of the DUCC v2.0.1, are this version of DUCC having that
>>> bug?
>>>
>>> Is this problem occur because of the Java Heap Space?
>>>
>>> Please suggest something as there are nothing in the logs regarding to my
>>> problem.
>>>
>>> Thanks and Regards
>>> Priyank Sharma
>>>
>>> On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
>>>
>>> Hi Priyank,
>>>>
>>>> Looks like you are running DUCC v2.0.x. There are so many bugs fixed in
>>>> subsequent versions, the latest being v2.2.1. Newer versions have a
>>>> ducc_update command that will upgrade an existing install, but given all
>>>> the changes since v2.0.x I suggest a clean install.
>>>>
>>>> Eddie
>>>>
>>>> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <
>>>> priyank.sharma@orkash.com>
>>>> wrote:
>>>>
>>>> There is nothing on the work item page and performance page on the web
>>>>
>>>>> server. There is only one log file for the main node, no log files for
>>>>> other two nodes. Ducc job processes not able to pick the data from the
>>>>> data
>>>>> source and no UIMA aggregator is working for that batches.
>>>>>
>>>>> Are the issue because of the java heap space? We are giving 4gb ram to
>>>>> the
>>>>> job-process.
>>>>>
>>>>> Attaching the Log file.
>>>>>
>>>>> Thanks and Regards
>>>>> Priyank Sharma
>>>>>
>>>>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>>>>
>>>>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>>>>>
>>>>>> page ducchost:42133/jobs.jsp then click on the id of your job.
>>>>>> Examine
>>>>>> the
>>>>>> logs by clicking on each log file name looking for any revealing
>>>>>> information.
>>>>>>
>>>>>> Feel free to post non-confidential snippets here, or If you'd like to
>>>>>> chat
>>>>>> in real time we can use hipchat.
>>>>>>
>>>>>> Lou.
>>>>>>
>>>>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <
>>>>>> priyank.sharma@orkash.com
>>>>>> wrote:
>>>>>>
>>>>>> All!
>>>>>>
>>>>>> I have a problem regarding DUCC cluster in which a job process gets
>>>>>>> stuck
>>>>>>> and keeps on processing the same batch again and again due to maximum
>>>>>>> duration the batch gets reason or extraordinary status
>>>>>>> *"**CanceledByUser"
>>>>>>> *and then gets restarted with the same ID's. This usually happens
>>>>>>> after
>>>>>>> 15
>>>>>>> to 20 days and goes away after restarting the ducc cluster. While
>>>>>>> going
>>>>>>> through the data store that is being used by CAS consumer to ingest
>>>>>>> data,
>>>>>>> the data regarding this batch does never get ingested. So most
>>>>>>> probably
>>>>>>> this data is not being processed.
>>>>>>>
>>>>>>> How to check if this data is being processed or not?
>>>>>>>
>>>>>>> Are the resources the issue and why it is being processed after
>>>>>>> restarting
>>>>>>> the cluster?
>>>>>>>
>>>>>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks and Regards
>>>>>>> Priyank Sharma
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>

Re: DUCC's job goes into infintie loop

Posted by priyank sharma <pr...@orkash.com>.
When our job goes into infinite-loop that time uima analysis engine did 
not start and one of the server out of three were down that server has 
all the service which is being used by the uima analysis engine.

Is the server down creates this issue?

is memory the problem?

Thanks and Regards
Priyank Sharma

On Monday 13 November 2017 07:38 PM, Eddie Epstein wrote:
> Several different issues here. There is no "job completion cap", rather
> there is a limit on how long an individual work item will be allowed to
> process before it is labeled a timeout. The default number of such errors +
> exceptions before a Job is stopped is 15. Please increase this cap if you
> expect a work item to go longer.
>
> If a job process runs out of heap space it should go OOM at which point
> unpredictable things will happen.  Do you see OOM exceptions in the JP
> logfiles?
>
> As for a bug, it is still hard to understand what is happening. Newer
> versions of DUCC include a ducc_gather_logs command that collects DUCC
> daemon logfiles and state and makes it more likely we can understand what
> is happening. No user application logfiles are included in the captured tar
> file.
>
> Regards,
> Eddie
>
> On Mon, Nov 13, 2017 at 12:33 AM, priyank sharma <pr...@orkash.com>
> wrote:
>
>> Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram,
>> 40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes into
>> the infinite loop with the same batch of the id's. We have a 75 minutes cap
>> for a job to complete if not then its start again so after every 75 minutes
>> new job starts but with the same id batch as previous and not even a single
>> document ingested in the data store it goes in the same state untill we
>> restarts the server.
>>
>> Is this because of the DUCC v2.0.1, are this version of DUCC having that
>> bug?
>>
>> Is this problem occur because of the Java Heap Space?
>>
>> Please suggest something as there are nothing in the logs regarding to my
>> problem.
>>
>> Thanks and Regards
>> Priyank Sharma
>>
>> On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
>>
>>> Hi Priyank,
>>>
>>> Looks like you are running DUCC v2.0.x. There are so many bugs fixed in
>>> subsequent versions, the latest being v2.2.1. Newer versions have a
>>> ducc_update command that will upgrade an existing install, but given all
>>> the changes since v2.0.x I suggest a clean install.
>>>
>>> Eddie
>>>
>>> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <
>>> priyank.sharma@orkash.com>
>>> wrote:
>>>
>>> There is nothing on the work item page and performance page on the web
>>>> server. There is only one log file for the main node, no log files for
>>>> other two nodes. Ducc job processes not able to pick the data from the
>>>> data
>>>> source and no UIMA aggregator is working for that batches.
>>>>
>>>> Are the issue because of the java heap space? We are giving 4gb ram to
>>>> the
>>>> job-process.
>>>>
>>>> Attaching the Log file.
>>>>
>>>> Thanks and Regards
>>>> Priyank Sharma
>>>>
>>>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>>>
>>>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>>>>> page ducchost:42133/jobs.jsp then click on the id of your job.  Examine
>>>>> the
>>>>> logs by clicking on each log file name looking for any revealing
>>>>> information.
>>>>>
>>>>> Feel free to post non-confidential snippets here, or If you'd like to
>>>>> chat
>>>>> in real time we can use hipchat.
>>>>>
>>>>> Lou.
>>>>>
>>>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <
>>>>> priyank.sharma@orkash.com
>>>>> wrote:
>>>>>
>>>>> All!
>>>>>
>>>>>> I have a problem regarding DUCC cluster in which a job process gets
>>>>>> stuck
>>>>>> and keeps on processing the same batch again and again due to maximum
>>>>>> duration the batch gets reason or extraordinary status
>>>>>> *"**CanceledByUser"
>>>>>> *and then gets restarted with the same ID's. This usually happens after
>>>>>> 15
>>>>>> to 20 days and goes away after restarting the ducc cluster. While going
>>>>>> through the data store that is being used by CAS consumer to ingest
>>>>>> data,
>>>>>> the data regarding this batch does never get ingested. So most probably
>>>>>> this data is not being processed.
>>>>>>
>>>>>> How to check if this data is being processed or not?
>>>>>>
>>>>>> Are the resources the issue and why it is being processed after
>>>>>> restarting
>>>>>> the cluster?
>>>>>>
>>>>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards
>>>>>> Priyank Sharma
>>>>>>
>>>>>>
>>>>>>
>>>>>>


Re: DUCC's job goes into infintie loop

Posted by Eddie Epstein <ea...@gmail.com>.
Several different issues here. There is no "job completion cap", rather
there is a limit on how long an individual work item will be allowed to
process before it is labeled a timeout. The default number of such errors +
exceptions before a Job is stopped is 15. Please increase this cap if you
expect a work item to go longer.

If a job process runs out of heap space it should go OOM at which point
unpredictable things will happen.  Do you see OOM exceptions in the JP
logfiles?

As for a bug, it is still hard to understand what is happening. Newer
versions of DUCC include a ducc_gather_logs command that collects DUCC
daemon logfiles and state and makes it more likely we can understand what
is happening. No user application logfiles are included in the captured tar
file.

Regards,
Eddie

On Mon, Nov 13, 2017 at 12:33 AM, priyank sharma <pr...@orkash.com>
wrote:

> Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram,
> 40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes into
> the infinite loop with the same batch of the id's. We have a 75 minutes cap
> for a job to complete if not then its start again so after every 75 minutes
> new job starts but with the same id batch as previous and not even a single
> document ingested in the data store it goes in the same state untill we
> restarts the server.
>
> Is this because of the DUCC v2.0.1, are this version of DUCC having that
> bug?
>
> Is this problem occur because of the Java Heap Space?
>
> Please suggest something as there are nothing in the logs regarding to my
> problem.
>
> Thanks and Regards
> Priyank Sharma
>
> On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
>
>> Hi Priyank,
>>
>> Looks like you are running DUCC v2.0.x. There are so many bugs fixed in
>> subsequent versions, the latest being v2.2.1. Newer versions have a
>> ducc_update command that will upgrade an existing install, but given all
>> the changes since v2.0.x I suggest a clean install.
>>
>> Eddie
>>
>> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <
>> priyank.sharma@orkash.com>
>> wrote:
>>
>> There is nothing on the work item page and performance page on the web
>>> server. There is only one log file for the main node, no log files for
>>> other two nodes. Ducc job processes not able to pick the data from the
>>> data
>>> source and no UIMA aggregator is working for that batches.
>>>
>>> Are the issue because of the java heap space? We are giving 4gb ram to
>>> the
>>> job-process.
>>>
>>> Attaching the Log file.
>>>
>>> Thanks and Regards
>>> Priyank Sharma
>>>
>>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>>
>>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>>>> page ducchost:42133/jobs.jsp then click on the id of your job.  Examine
>>>> the
>>>> logs by clicking on each log file name looking for any revealing
>>>> information.
>>>>
>>>> Feel free to post non-confidential snippets here, or If you'd like to
>>>> chat
>>>> in real time we can use hipchat.
>>>>
>>>> Lou.
>>>>
>>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <
>>>> priyank.sharma@orkash.com
>>>> wrote:
>>>>
>>>> All!
>>>>
>>>>> I have a problem regarding DUCC cluster in which a job process gets
>>>>> stuck
>>>>> and keeps on processing the same batch again and again due to maximum
>>>>> duration the batch gets reason or extraordinary status
>>>>> *"**CanceledByUser"
>>>>> *and then gets restarted with the same ID's. This usually happens after
>>>>> 15
>>>>> to 20 days and goes away after restarting the ducc cluster. While going
>>>>> through the data store that is being used by CAS consumer to ingest
>>>>> data,
>>>>> the data regarding this batch does never get ingested. So most probably
>>>>> this data is not being processed.
>>>>>
>>>>> How to check if this data is being processed or not?
>>>>>
>>>>> Are the resources the issue and why it is being processed after
>>>>> restarting
>>>>> the cluster?
>>>>>
>>>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards
>>>>> Priyank Sharma
>>>>>
>>>>>
>>>>>
>>>>>
>

Re: DUCC's job goes into infintie loop

Posted by priyank sharma <pr...@orkash.com>.
Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram, 
40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes 
into the infinite loop with the same batch of the id's. We have a 75 
minutes cap for a job to complete if not then its start again so after 
every 75 minutes new job starts but with the same id batch as previous 
and not even a single document ingested in the data store it goes in the 
same state untill we restarts the server.

Is this because of the DUCC v2.0.1, are this version of DUCC having that 
bug?

Is this problem occur because of the Java Heap Space?

Please suggest something as there are nothing in the logs regarding to 
my problem.

Thanks and Regards
Priyank Sharma

On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
> Hi Priyank,
>
> Looks like you are running DUCC v2.0.x. There are so many bugs fixed in
> subsequent versions, the latest being v2.2.1. Newer versions have a
> ducc_update command that will upgrade an existing install, but given all
> the changes since v2.0.x I suggest a clean install.
>
> Eddie
>
> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <pr...@orkash.com>
> wrote:
>
>> There is nothing on the work item page and performance page on the web
>> server. There is only one log file for the main node, no log files for
>> other two nodes. Ducc job processes not able to pick the data from the data
>> source and no UIMA aggregator is working for that batches.
>>
>> Are the issue because of the java heap space? We are giving 4gb ram to the
>> job-process.
>>
>> Attaching the Log file.
>>
>> Thanks and Regards
>> Priyank Sharma
>>
>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>
>>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>>> page ducchost:42133/jobs.jsp then click on the id of your job.  Examine
>>> the
>>> logs by clicking on each log file name looking for any revealing
>>> information.
>>>
>>> Feel free to post non-confidential snippets here, or If you'd like to chat
>>> in real time we can use hipchat.
>>>
>>> Lou.
>>>
>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <priyank.sharma@orkash.com
>>> wrote:
>>>
>>> All!
>>>> I have a problem regarding DUCC cluster in which a job process gets stuck
>>>> and keeps on processing the same batch again and again due to maximum
>>>> duration the batch gets reason or extraordinary status
>>>> *"**CanceledByUser"
>>>> *and then gets restarted with the same ID's. This usually happens after
>>>> 15
>>>> to 20 days and goes away after restarting the ducc cluster. While going
>>>> through the data store that is being used by CAS consumer to ingest data,
>>>> the data regarding this batch does never get ingested. So most probably
>>>> this data is not being processed.
>>>>
>>>> How to check if this data is being processed or not?
>>>>
>>>> Are the resources the issue and why it is being processed after
>>>> restarting
>>>> the cluster?
>>>>
>>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks and Regards
>>>> Priyank Sharma
>>>>
>>>>
>>>>


Re: DUCC's job goes into infintie loop

Posted by Eddie Epstein <ea...@gmail.com>.
Hi Priyank,

Looks like you are running DUCC v2.0.x. There are so many bugs fixed in
subsequent versions, the latest being v2.2.1. Newer versions have a
ducc_update command that will upgrade an existing install, but given all
the changes since v2.0.x I suggest a clean install.

Eddie

On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <pr...@orkash.com>
wrote:

> There is nothing on the work item page and performance page on the web
> server. There is only one log file for the main node, no log files for
> other two nodes. Ducc job processes not able to pick the data from the data
> source and no UIMA aggregator is working for that batches.
>
> Are the issue because of the java heap space? We are giving 4gb ram to the
> job-process.
>
> Attaching the Log file.
>
> Thanks and Regards
> Priyank Sharma
>
> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>
>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>> page ducchost:42133/jobs.jsp then click on the id of your job.  Examine
>> the
>> logs by clicking on each log file name looking for any revealing
>> information.
>>
>> Feel free to post non-confidential snippets here, or If you'd like to chat
>> in real time we can use hipchat.
>>
>> Lou.
>>
>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <priyank.sharma@orkash.com
>> >
>> wrote:
>>
>> All!
>>>
>>> I have a problem regarding DUCC cluster in which a job process gets stuck
>>> and keeps on processing the same batch again and again due to maximum
>>> duration the batch gets reason or extraordinary status
>>> *"**CanceledByUser"
>>> *and then gets restarted with the same ID's. This usually happens after
>>> 15
>>> to 20 days and goes away after restarting the ducc cluster. While going
>>> through the data store that is being used by CAS consumer to ingest data,
>>> the data regarding this batch does never get ingested. So most probably
>>> this data is not being processed.
>>>
>>> How to check if this data is being processed or not?
>>>
>>> Are the resources the issue and why it is being processed after
>>> restarting
>>> the cluster?
>>>
>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>> Priyank Sharma
>>>
>>>
>>>
>

Re: DUCC's job goes into infintie loop

Posted by priyank sharma <pr...@orkash.com>.
There is nothing on the work item page and performance page on the web 
server. There is only one log file for the main node, no log files for 
other two nodes. Ducc job processes not able to pick the data from the 
data source and no UIMA aggregator is working for that batches.

Are the issue because of the java heap space? We are giving 4gb ram to 
the job-process.

Attaching the Log file.

Thanks and Regards
Priyank Sharma

On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
> The first place to look is in your job's logs.  Visit the ducc-mon jobs
> page ducchost:42133/jobs.jsp then click on the id of your job.  Examine the
> logs by clicking on each log file name looking for any revealing
> information.
>
> Feel free to post non-confidential snippets here, or If you'd like to chat
> in real time we can use hipchat.
>
> Lou.
>
> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <pr...@orkash.com>
> wrote:
>
>> All!
>>
>> I have a problem regarding DUCC cluster in which a job process gets stuck
>> and keeps on processing the same batch again and again due to maximum
>> duration the batch gets reason or extraordinary status *"**CanceledByUser"
>> *and then gets restarted with the same ID's. This usually happens after 15
>> to 20 days and goes away after restarting the ducc cluster. While going
>> through the data store that is being used by CAS consumer to ingest data,
>> the data regarding this batch does never get ingested. So most probably
>> this data is not being processed.
>>
>> How to check if this data is being processed or not?
>>
>> Are the resources the issue and why it is being processed after restarting
>> the cluster?
>>
>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>
>>
>>
>> --
>> Thanks and Regards
>> Priyank Sharma
>>
>>


Re: DUCC's job goes into infintie loop

Posted by Lou DeGenaro <lo...@gmail.com>.
The first place to look is in your job's logs.  Visit the ducc-mon jobs
page ducchost:42133/jobs.jsp then click on the id of your job.  Examine the
logs by clicking on each log file name looking for any revealing
information.

Feel free to post non-confidential snippets here, or If you'd like to chat
in real time we can use hipchat.

Lou.

On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <pr...@orkash.com>
wrote:

> All!
>
> I have a problem regarding DUCC cluster in which a job process gets stuck
> and keeps on processing the same batch again and again due to maximum
> duration the batch gets reason or extraordinary status *"**CanceledByUser"
> *and then gets restarted with the same ID's. This usually happens after 15
> to 20 days and goes away after restarting the ducc cluster. While going
> through the data store that is being used by CAS consumer to ingest data,
> the data regarding this batch does never get ingested. So most probably
> this data is not being processed.
>
> How to check if this data is being processed or not?
>
> Are the resources the issue and why it is being processed after restarting
> the cluster?
>
> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>
>
>
> --
> Thanks and Regards
> Priyank Sharma
>
>