You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by "reshu.agarwal" <re...@orkash.com> on 2014/12/04 13:41:11 UTC

DUCC-unstable behaviour od ducc

Hi,

Please look this stats:

/    Status    Name    Memory(GB):usable Memory(GB):total 
Swap(GB):inuse    Swap(GB):free    Alien PIDs    Shares:total 
Shares:inuse    Heartbeat (last)
     Total                                        58 70        
                     0 29                         9                 
29                    3
     up    S144                               36 39            
                 0 20                         8                18 2   
                  59
     down    S143                           22 31              
               0 9                           1                11 
11                    58
/
I am not able to understand this stats.

Please help.

Reshu.

Re: DUCC-unstable behaviour od ducc

Posted by Lou DeGenaro <lo...@gmail.com>.

Are the machines where your DUCC daemons and/or agents run extremely busy?
Otherwise, I should think that the default heartbeat config should work as
is.

Lou.

On Wed, Dec 10, 2014 at 4:06 AM, reshu.agarwal <re...@orkash.com>
wrote:

> Dear Lou,
>
> My problem has been resolved. I just increased the max time of receiving
> Heartbeats of agents.
>
> The "unstable behavior" of DUCC 1.1.0 in my case was the node up and down
> problem in both cases either on single instance of DUCC 1.1.0
> or running both ducc versions simultaneously.
>
> And Now, I am able to run DUCC 1.1.0 alone. So, Only DUCC 1.1.0 is
> configured.
>
> Thanks for your help. :-)
>
> Reshu.
>
>
>
>
> On 12/08/2014 04:24 PM, Lou DeGenaro wrote:
>
>> What is the "unstable behavior" of DUCC 1.1.0 when running it alone?
>>
>> All kinds of bad things can happen if you run 2 DUCCs on the same set of
>> machines. I'm willing to help, but am not sure I can if you are running 2
>> DUCCs - that's fairly complex.  Instead I urge you to run a single DUCC
>> 1.1.0 and let's try to fix what's wrong with running it alone.
>>
>> Lou.
>>
>> On Sun, Dec 7, 2014 at 11:40 PM, reshu.agarwal <re...@orkash.com>
>> wrote:
>>
>>  Yes, I am running both at same time. But I tried only 1.1.0 version to
>>> check the performance.But, due to unstable behaviour I had to run DUCC
>>> 1.0.0 and DUCC 1.1.0 at the same time.  I am running DUCC 1.0.0 for
>>> running
>>> Jobs and DUCC 1.1.0 for testing purpose.
>>>
>>> Do I need to increase heartbeats timing to greater than to 60 sec?
>>> Signature
>>>
>>> **Reshu.
>>>
>>>
>>> On 12/05/2014 05:57 PM, Lou DeGenaro wrote:
>>>
>>>  You can fetch the latest code containing the bug fix from SVN and build
>>>> your own snapshot.  However, this bug is of minimal impact so there is
>>>> no
>>>> pressing need to do so.
>>>>
>>>> Are you trying to run 1.0 and 1.1 at the same time?  This can be very
>>>> tricky.  You need to be sure of no overlaps.  I highly recommend that
>>>> you
>>>> pick one or the other.
>>>>
>>>> Lou.
>>>>
>>>> On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal <reshu.agarwal@orkash.com
>>>> >
>>>> wrote:
>>>>
>>>>   Dear Lou,
>>>>
>>>>> Thanks for confirming this.
>>>>>
>>>>> Is Bug fixing version available for use?
>>>>>
>>>>> What can be the reason of delaying in heartbeats? Because machines were
>>>>> not able to send heartbeats with in 60 seconds so node gets down in
>>>>> DUCC
>>>>> 1.1.0 but DUCC 1.0.0 is working fine on same machines.
>>>>>
>>>>> My master node is physical and client is on virtual. Can this be a
>>>>> reason
>>>>> for delaying in heartbeats as well as increase of processing time of
>>>>> job?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Reshu.
>>>>>
>>>>>
>>>>> On 12/05/2014 04:45 PM, Lou DeGenaro wrote:
>>>>>
>>>>>   Each node has a DUCC Agent daemon that sends heartbeats.
>>>>>
>>>>>> There was a bug discovered after the release of 1.1 whereby the share
>>>>>> calculation is incorrect (a rounding up problem that you observe).
>>>>>> The
>>>>>> impact of this bug should be minimal.  The bug has been fixed.
>>>>>>
>>>>>> Lou.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal <
>>>>>> reshu.agarwal@orkash.com>
>>>>>> wrote:
>>>>>>
>>>>>>    Lou,
>>>>>>
>>>>>>  How can a node send heartbeats in DUCC? If you can tell me this I
>>>>>>> will
>>>>>>> be
>>>>>>> able to identify problem of down in my nodes.
>>>>>>>
>>>>>>> The other problem which I am facing is:
>>>>>>>
>>>>>>> Memory(GB):total    :   31
>>>>>>> Memory(GB):usable :   16
>>>>>>> Shares:total             :    8
>>>>>>> Shares:inuse            :   9
>>>>>>>
>>>>>>>
>>>>>>> Means actual RAM which is available is 30 GB so shares available
>>>>>>> should
>>>>>>> be
>>>>>>> 15(2GB per share) but it is showing Memory(GB):usable :   16 and
>>>>>>> Shares:total             :    8.
>>>>>>>
>>>>>>> In DUCC 1.0.0, I don't face this problem.
>>>>>>>
>>>>>>> Please explain me its reason.
>>>>>>>
>>>>>>> Reshu.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
>>>>>>>
>>>>>>>    Which of these are no understandable?  If you hover over the
>>>>>>> column
>>>>>>>
>>>>>>>  heading
>>>>>>>> a little more explanation is given (though not much).
>>>>>>>>
>>>>>>>> For example, If you hover over Heartbeat(last) you'll see "The
>>>>>>>> elapsed
>>>>>>>> time
>>>>>>>> (in seconds) since the last heartbeat".  This should usually be
>>>>>>>> around
>>>>>>>> 60
>>>>>>>> seconds.  On the system I'm looking at live presently, I see a range
>>>>>>>> from
>>>>>>>> 9
>>>>>>>> to 66.  If the number gets too large, the DUCC system will consider
>>>>>>>> the
>>>>>>>> node down.  As best as I can tell, it looks like your numbers are
>>>>>>>> 58 &
>>>>>>>> 59
>>>>>>>> which is perfect.
>>>>>>>>
>>>>>>>> Lou.
>>>>>>>>
>>>>>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <
>>>>>>>> reshu.agarwal@orkash.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>     Hi,
>>>>>>>>
>>>>>>>>   Please look this stats:
>>>>>>>>
>>>>>>>>> /    Status    Name    Memory(GB):usable Memory(GB):total
>>>>>>>>> Swap(GB):inuse
>>>>>>>>>       Swap(GB):free    Alien PIDs    Shares:total Shares:inuse
>>>>>>>>> Heartbeat
>>>>>>>>> (last)
>>>>>>>>>         Total                                        58 70
>>>>>>>>>             0 29                         9                 29
>>>>>>>>>       3
>>>>>>>>>         up    S144                               36 39
>>>>>>>>>         0 20                         8                18 2
>>>>>>>>>      59
>>>>>>>>>         down    S143                           22 31
>>>>>>>>>       0 9                           1                11 11
>>>>>>>>>      58
>>>>>>>>> /
>>>>>>>>> I am not able to understand this stats.
>>>>>>>>>
>>>>>>>>> Please help.
>>>>>>>>>
>>>>>>>>> Reshu.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>

Re: DUCC-unstable behaviour od ducc

Posted by "reshu.agarwal" <re...@orkash.com>.

Dear Lou,

My problem has been resolved. I just increased the max time of receiving 
Heartbeats of agents.

The "unstable behavior" of DUCC 1.1.0 in my case was the node up and 
down problem in both cases either on single instance of DUCC 1.1.0
or running both ducc versions simultaneously.

And Now, I am able to run DUCC 1.1.0 alone. So, Only DUCC 1.1.0 is 
configured.

Thanks for your help. :-)

Reshu.



On 12/08/2014 04:24 PM, Lou DeGenaro wrote:
> What is the "unstable behavior" of DUCC 1.1.0 when running it alone?
>
> All kinds of bad things can happen if you run 2 DUCCs on the same set of
> machines. I'm willing to help, but am not sure I can if you are running 2
> DUCCs - that's fairly complex.  Instead I urge you to run a single DUCC
> 1.1.0 and let's try to fix what's wrong with running it alone.
>
> Lou.
>
> On Sun, Dec 7, 2014 at 11:40 PM, reshu.agarwal <re...@orkash.com>
> wrote:
>
>> Yes, I am running both at same time. But I tried only 1.1.0 version to
>> check the performance.But, due to unstable behaviour I had to run DUCC
>> 1.0.0 and DUCC 1.1.0 at the same time.  I am running DUCC 1.0.0 for running
>> Jobs and DUCC 1.1.0 for testing purpose.
>>
>> Do I need to increase heartbeats timing to greater than to 60 sec?
>> Signature
>>
>> **Reshu.
>>
>>
>> On 12/05/2014 05:57 PM, Lou DeGenaro wrote:
>>
>>> You can fetch the latest code containing the bug fix from SVN and build
>>> your own snapshot.  However, this bug is of minimal impact so there is no
>>> pressing need to do so.
>>>
>>> Are you trying to run 1.0 and 1.1 at the same time?  This can be very
>>> tricky.  You need to be sure of no overlaps.  I highly recommend that you
>>> pick one or the other.
>>>
>>> Lou.
>>>
>>> On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal <re...@orkash.com>
>>> wrote:
>>>
>>>   Dear Lou,
>>>> Thanks for confirming this.
>>>>
>>>> Is Bug fixing version available for use?
>>>>
>>>> What can be the reason of delaying in heartbeats? Because machines were
>>>> not able to send heartbeats with in 60 seconds so node gets down in DUCC
>>>> 1.1.0 but DUCC 1.0.0 is working fine on same machines.
>>>>
>>>> My master node is physical and client is on virtual. Can this be a reason
>>>> for delaying in heartbeats as well as increase of processing time of job?
>>>>
>>>> Thanks.
>>>>
>>>> Reshu.
>>>>
>>>>
>>>> On 12/05/2014 04:45 PM, Lou DeGenaro wrote:
>>>>
>>>>   Each node has a DUCC Agent daemon that sends heartbeats.
>>>>> There was a bug discovered after the release of 1.1 whereby the share
>>>>> calculation is incorrect (a rounding up problem that you observe).  The
>>>>> impact of this bug should be minimal.  The bug has been fixed.
>>>>>
>>>>> Lou.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal <
>>>>> reshu.agarwal@orkash.com>
>>>>> wrote:
>>>>>
>>>>>    Lou,
>>>>>
>>>>>> How can a node send heartbeats in DUCC? If you can tell me this I will
>>>>>> be
>>>>>> able to identify problem of down in my nodes.
>>>>>>
>>>>>> The other problem which I am facing is:
>>>>>>
>>>>>> Memory(GB):total    :   31
>>>>>> Memory(GB):usable :   16
>>>>>> Shares:total             :    8
>>>>>> Shares:inuse            :   9
>>>>>>
>>>>>>
>>>>>> Means actual RAM which is available is 30 GB so shares available should
>>>>>> be
>>>>>> 15(2GB per share) but it is showing Memory(GB):usable :   16 and
>>>>>> Shares:total             :    8.
>>>>>>
>>>>>> In DUCC 1.0.0, I don't face this problem.
>>>>>>
>>>>>> Please explain me its reason.
>>>>>>
>>>>>> Reshu.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
>>>>>>
>>>>>>    Which of these are no understandable?  If you hover over the column
>>>>>>
>>>>>>> heading
>>>>>>> a little more explanation is given (though not much).
>>>>>>>
>>>>>>> For example, If you hover over Heartbeat(last) you'll see "The elapsed
>>>>>>> time
>>>>>>> (in seconds) since the last heartbeat".  This should usually be around
>>>>>>> 60
>>>>>>> seconds.  On the system I'm looking at live presently, I see a range
>>>>>>> from
>>>>>>> 9
>>>>>>> to 66.  If the number gets too large, the DUCC system will consider
>>>>>>> the
>>>>>>> node down.  As best as I can tell, it looks like your numbers are 58 &
>>>>>>> 59
>>>>>>> which is perfect.
>>>>>>>
>>>>>>> Lou.
>>>>>>>
>>>>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <
>>>>>>> reshu.agarwal@orkash.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>     Hi,
>>>>>>>
>>>>>>>   Please look this stats:
>>>>>>>> /    Status    Name    Memory(GB):usable Memory(GB):total
>>>>>>>> Swap(GB):inuse
>>>>>>>>       Swap(GB):free    Alien PIDs    Shares:total Shares:inuse
>>>>>>>> Heartbeat
>>>>>>>> (last)
>>>>>>>>         Total                                        58 70
>>>>>>>>             0 29                         9                 29
>>>>>>>>       3
>>>>>>>>         up    S144                               36 39
>>>>>>>>         0 20                         8                18 2
>>>>>>>>      59
>>>>>>>>         down    S143                           22 31
>>>>>>>>       0 9                           1                11 11
>>>>>>>>      58
>>>>>>>> /
>>>>>>>> I am not able to understand this stats.
>>>>>>>>
>>>>>>>> Please help.
>>>>>>>>
>>>>>>>> Reshu.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: DUCC-unstable behaviour od ducc

Posted by Lou DeGenaro <lo...@gmail.com>.

What is the "unstable behavior" of DUCC 1.1.0 when running it alone?

All kinds of bad things can happen if you run 2 DUCCs on the same set of
machines. I'm willing to help, but am not sure I can if you are running 2
DUCCs - that's fairly complex.  Instead I urge you to run a single DUCC
1.1.0 and let's try to fix what's wrong with running it alone.

Lou.

On Sun, Dec 7, 2014 at 11:40 PM, reshu.agarwal <re...@orkash.com>
wrote:

>
> Yes, I am running both at same time. But I tried only 1.1.0 version to
> check the performance.But, due to unstable behaviour I had to run DUCC
> 1.0.0 and DUCC 1.1.0 at the same time.  I am running DUCC 1.0.0 for running
> Jobs and DUCC 1.1.0 for testing purpose.
>
> Do I need to increase heartbeats timing to greater than to 60 sec?
> Signature
>
> **Reshu.
>
>
> On 12/05/2014 05:57 PM, Lou DeGenaro wrote:
>
>> You can fetch the latest code containing the bug fix from SVN and build
>> your own snapshot.  However, this bug is of minimal impact so there is no
>> pressing need to do so.
>>
>> Are you trying to run 1.0 and 1.1 at the same time?  This can be very
>> tricky.  You need to be sure of no overlaps.  I highly recommend that you
>> pick one or the other.
>>
>> Lou.
>>
>> On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal <re...@orkash.com>
>> wrote:
>>
>>  Dear Lou,
>>>
>>> Thanks for confirming this.
>>>
>>> Is Bug fixing version available for use?
>>>
>>> What can be the reason of delaying in heartbeats? Because machines were
>>> not able to send heartbeats with in 60 seconds so node gets down in DUCC
>>> 1.1.0 but DUCC 1.0.0 is working fine on same machines.
>>>
>>> My master node is physical and client is on virtual. Can this be a reason
>>> for delaying in heartbeats as well as increase of processing time of job?
>>>
>>> Thanks.
>>>
>>> Reshu.
>>>
>>>
>>> On 12/05/2014 04:45 PM, Lou DeGenaro wrote:
>>>
>>>  Each node has a DUCC Agent daemon that sends heartbeats.
>>>>
>>>> There was a bug discovered after the release of 1.1 whereby the share
>>>> calculation is incorrect (a rounding up problem that you observe).  The
>>>> impact of this bug should be minimal.  The bug has been fixed.
>>>>
>>>> Lou.
>>>>
>>>>
>>>>
>>>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal <
>>>> reshu.agarwal@orkash.com>
>>>> wrote:
>>>>
>>>>   Lou,
>>>>
>>>>> How can a node send heartbeats in DUCC? If you can tell me this I will
>>>>> be
>>>>> able to identify problem of down in my nodes.
>>>>>
>>>>> The other problem which I am facing is:
>>>>>
>>>>> Memory(GB):total    :   31
>>>>> Memory(GB):usable :   16
>>>>> Shares:total             :    8
>>>>> Shares:inuse            :   9
>>>>>
>>>>>
>>>>> Means actual RAM which is available is 30 GB so shares available should
>>>>> be
>>>>> 15(2GB per share) but it is showing Memory(GB):usable :   16 and
>>>>> Shares:total             :    8.
>>>>>
>>>>> In DUCC 1.0.0, I don't face this problem.
>>>>>
>>>>> Please explain me its reason.
>>>>>
>>>>> Reshu.
>>>>>
>>>>>
>>>>>
>>>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
>>>>>
>>>>>   Which of these are no understandable?  If you hover over the column
>>>>>
>>>>>> heading
>>>>>> a little more explanation is given (though not much).
>>>>>>
>>>>>> For example, If you hover over Heartbeat(last) you'll see "The elapsed
>>>>>> time
>>>>>> (in seconds) since the last heartbeat".  This should usually be around
>>>>>> 60
>>>>>> seconds.  On the system I'm looking at live presently, I see a range
>>>>>> from
>>>>>> 9
>>>>>> to 66.  If the number gets too large, the DUCC system will consider
>>>>>> the
>>>>>> node down.  As best as I can tell, it looks like your numbers are 58 &
>>>>>> 59
>>>>>> which is perfect.
>>>>>>
>>>>>> Lou.
>>>>>>
>>>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <
>>>>>> reshu.agarwal@orkash.com
>>>>>> wrote:
>>>>>>
>>>>>>    Hi,
>>>>>>
>>>>>>  Please look this stats:
>>>>>>>
>>>>>>> /    Status    Name    Memory(GB):usable Memory(GB):total
>>>>>>> Swap(GB):inuse
>>>>>>>      Swap(GB):free    Alien PIDs    Shares:total Shares:inuse
>>>>>>> Heartbeat
>>>>>>> (last)
>>>>>>>        Total                                        58 70
>>>>>>>            0 29                         9                 29
>>>>>>>      3
>>>>>>>        up    S144                               36 39
>>>>>>>        0 20                         8                18 2
>>>>>>>     59
>>>>>>>        down    S143                           22 31
>>>>>>>      0 9                           1                11 11
>>>>>>>     58
>>>>>>> /
>>>>>>> I am not able to understand this stats.
>>>>>>>
>>>>>>> Please help.
>>>>>>>
>>>>>>> Reshu.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>

Re: DUCC-unstable behaviour od ducc

Posted by "reshu.agarwal" <re...@orkash.com>.

Yes, I am running both at same time. But I tried only 1.1.0 version to 
check the performance.But, due to unstable behaviour I had to run DUCC 
1.0.0 and DUCC 1.1.0 at the same time.  I am running DUCC 1.0.0 for 
running Jobs and DUCC 1.1.0 for testing purpose.

Do I need to increase heartbeats timing to greater than to 60 sec?
Signature

**Reshu.

On 12/05/2014 05:57 PM, Lou DeGenaro wrote:
> You can fetch the latest code containing the bug fix from SVN and build
> your own snapshot.  However, this bug is of minimal impact so there is no
> pressing need to do so.
>
> Are you trying to run 1.0 and 1.1 at the same time?  This can be very
> tricky.  You need to be sure of no overlaps.  I highly recommend that you
> pick one or the other.
>
> Lou.
>
> On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal <re...@orkash.com>
> wrote:
>
>> Dear Lou,
>>
>> Thanks for confirming this.
>>
>> Is Bug fixing version available for use?
>>
>> What can be the reason of delaying in heartbeats? Because machines were
>> not able to send heartbeats with in 60 seconds so node gets down in DUCC
>> 1.1.0 but DUCC 1.0.0 is working fine on same machines.
>>
>> My master node is physical and client is on virtual. Can this be a reason
>> for delaying in heartbeats as well as increase of processing time of job?
>>
>> Thanks.
>>
>> Reshu.
>>
>>
>> On 12/05/2014 04:45 PM, Lou DeGenaro wrote:
>>
>>> Each node has a DUCC Agent daemon that sends heartbeats.
>>>
>>> There was a bug discovered after the release of 1.1 whereby the share
>>> calculation is incorrect (a rounding up problem that you observe).  The
>>> impact of this bug should be minimal.  The bug has been fixed.
>>>
>>> Lou.
>>>
>>>
>>>
>>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal <re...@orkash.com>
>>> wrote:
>>>
>>>   Lou,
>>>> How can a node send heartbeats in DUCC? If you can tell me this I will be
>>>> able to identify problem of down in my nodes.
>>>>
>>>> The other problem which I am facing is:
>>>>
>>>> Memory(GB):total    :   31
>>>> Memory(GB):usable :   16
>>>> Shares:total             :    8
>>>> Shares:inuse            :   9
>>>>
>>>>
>>>> Means actual RAM which is available is 30 GB so shares available should
>>>> be
>>>> 15(2GB per share) but it is showing Memory(GB):usable :   16 and
>>>> Shares:total             :    8.
>>>>
>>>> In DUCC 1.0.0, I don't face this problem.
>>>>
>>>> Please explain me its reason.
>>>>
>>>> Reshu.
>>>>
>>>>
>>>>
>>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
>>>>
>>>>   Which of these are no understandable?  If you hover over the column
>>>>> heading
>>>>> a little more explanation is given (though not much).
>>>>>
>>>>> For example, If you hover over Heartbeat(last) you'll see "The elapsed
>>>>> time
>>>>> (in seconds) since the last heartbeat".  This should usually be around
>>>>> 60
>>>>> seconds.  On the system I'm looking at live presently, I see a range
>>>>> from
>>>>> 9
>>>>> to 66.  If the number gets too large, the DUCC system will consider the
>>>>> node down.  As best as I can tell, it looks like your numbers are 58 &
>>>>> 59
>>>>> which is perfect.
>>>>>
>>>>> Lou.
>>>>>
>>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <reshu.agarwal@orkash.com
>>>>> wrote:
>>>>>
>>>>>    Hi,
>>>>>
>>>>>> Please look this stats:
>>>>>>
>>>>>> /    Status    Name    Memory(GB):usable Memory(GB):total
>>>>>> Swap(GB):inuse
>>>>>>      Swap(GB):free    Alien PIDs    Shares:total Shares:inuse
>>>>>> Heartbeat
>>>>>> (last)
>>>>>>        Total                                        58 70
>>>>>>            0 29                         9                 29
>>>>>>      3
>>>>>>        up    S144                               36 39
>>>>>>        0 20                         8                18 2
>>>>>>     59
>>>>>>        down    S143                           22 31
>>>>>>      0 9                           1                11 11
>>>>>>     58
>>>>>> /
>>>>>> I am not able to understand this stats.
>>>>>>
>>>>>> Please help.
>>>>>>
>>>>>> Reshu.
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: DUCC-unstable behaviour od ducc

Posted by Lou DeGenaro <lo...@gmail.com>.

You can fetch the latest code containing the bug fix from SVN and build
your own snapshot.  However, this bug is of minimal impact so there is no
pressing need to do so.

Are you trying to run 1.0 and 1.1 at the same time?  This can be very
tricky.  You need to be sure of no overlaps.  I highly recommend that you
pick one or the other.

Lou.

On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal <re...@orkash.com>
wrote:

> Dear Lou,
>
> Thanks for confirming this.
>
> Is Bug fixing version available for use?
>
> What can be the reason of delaying in heartbeats? Because machines were
> not able to send heartbeats with in 60 seconds so node gets down in DUCC
> 1.1.0 but DUCC 1.0.0 is working fine on same machines.
>
> My master node is physical and client is on virtual. Can this be a reason
> for delaying in heartbeats as well as increase of processing time of job?
>
> Thanks.
>
> Reshu.
>
>
> On 12/05/2014 04:45 PM, Lou DeGenaro wrote:
>
>> Each node has a DUCC Agent daemon that sends heartbeats.
>>
>> There was a bug discovered after the release of 1.1 whereby the share
>> calculation is incorrect (a rounding up problem that you observe).  The
>> impact of this bug should be minimal.  The bug has been fixed.
>>
>> Lou.
>>
>>
>>
>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal <re...@orkash.com>
>> wrote:
>>
>>  Lou,
>>>
>>> How can a node send heartbeats in DUCC? If you can tell me this I will be
>>> able to identify problem of down in my nodes.
>>>
>>> The other problem which I am facing is:
>>>
>>> Memory(GB):total    :   31
>>> Memory(GB):usable :   16
>>> Shares:total             :    8
>>> Shares:inuse            :   9
>>>
>>>
>>> Means actual RAM which is available is 30 GB so shares available should
>>> be
>>> 15(2GB per share) but it is showing Memory(GB):usable :   16 and
>>> Shares:total             :    8.
>>>
>>> In DUCC 1.0.0, I don't face this problem.
>>>
>>> Please explain me its reason.
>>>
>>> Reshu.
>>>
>>>
>>>
>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
>>>
>>>  Which of these are no understandable?  If you hover over the column
>>>> heading
>>>> a little more explanation is given (though not much).
>>>>
>>>> For example, If you hover over Heartbeat(last) you'll see "The elapsed
>>>> time
>>>> (in seconds) since the last heartbeat".  This should usually be around
>>>> 60
>>>> seconds.  On the system I'm looking at live presently, I see a range
>>>> from
>>>> 9
>>>> to 66.  If the number gets too large, the DUCC system will consider the
>>>> node down.  As best as I can tell, it looks like your numbers are 58 &
>>>> 59
>>>> which is perfect.
>>>>
>>>> Lou.
>>>>
>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <reshu.agarwal@orkash.com
>>>> >
>>>> wrote:
>>>>
>>>>   Hi,
>>>>
>>>>> Please look this stats:
>>>>>
>>>>> /    Status    Name    Memory(GB):usable Memory(GB):total
>>>>> Swap(GB):inuse
>>>>>     Swap(GB):free    Alien PIDs    Shares:total Shares:inuse
>>>>> Heartbeat
>>>>> (last)
>>>>>       Total                                        58 70
>>>>>           0 29                         9                 29
>>>>>     3
>>>>>       up    S144                               36 39
>>>>>       0 20                         8                18 2
>>>>>    59
>>>>>       down    S143                           22 31
>>>>>     0 9                           1                11 11
>>>>>    58
>>>>> /
>>>>> I am not able to understand this stats.
>>>>>
>>>>> Please help.
>>>>>
>>>>> Reshu.
>>>>>
>>>>>
>>>>>
>>>>>
>

Re: DUCC-unstable behaviour od ducc

Posted by "reshu.agarwal" <re...@orkash.com>.

Dear Lou,

Thanks for confirming this.

Is Bug fixing version available for use?

What can be the reason of delaying in heartbeats? Because machines were 
not able to send heartbeats with in 60 seconds so node gets down in DUCC 
1.1.0 but DUCC 1.0.0 is working fine on same machines.

My master node is physical and client is on virtual. Can this be a 
reason for delaying in heartbeats as well as increase of processing time 
of job?

Thanks.

Reshu.

On 12/05/2014 04:45 PM, Lou DeGenaro wrote:
> Each node has a DUCC Agent daemon that sends heartbeats.
>
> There was a bug discovered after the release of 1.1 whereby the share
> calculation is incorrect (a rounding up problem that you observe).  The
> impact of this bug should be minimal.  The bug has been fixed.
>
> Lou.
>
>
>
> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal <re...@orkash.com>
> wrote:
>
>> Lou,
>>
>> How can a node send heartbeats in DUCC? If you can tell me this I will be
>> able to identify problem of down in my nodes.
>>
>> The other problem which I am facing is:
>>
>> Memory(GB):total    :   31
>> Memory(GB):usable :   16
>> Shares:total             :    8
>> Shares:inuse            :   9
>>
>>
>> Means actual RAM which is available is 30 GB so shares available should be
>> 15(2GB per share) but it is showing Memory(GB):usable :   16 and
>> Shares:total             :    8.
>>
>> In DUCC 1.0.0, I don't face this problem.
>>
>> Please explain me its reason.
>>
>> Reshu.
>>
>>
>>
>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
>>
>>> Which of these are no understandable?  If you hover over the column
>>> heading
>>> a little more explanation is given (though not much).
>>>
>>> For example, If you hover over Heartbeat(last) you'll see "The elapsed
>>> time
>>> (in seconds) since the last heartbeat".  This should usually be around 60
>>> seconds.  On the system I'm looking at live presently, I see a range from
>>> 9
>>> to 66.  If the number gets too large, the DUCC system will consider the
>>> node down.  As best as I can tell, it looks like your numbers are 58 & 59
>>> which is perfect.
>>>
>>> Lou.
>>>
>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <re...@orkash.com>
>>> wrote:
>>>
>>>   Hi,
>>>> Please look this stats:
>>>>
>>>> /    Status    Name    Memory(GB):usable Memory(GB):total Swap(GB):inuse
>>>>     Swap(GB):free    Alien PIDs    Shares:total Shares:inuse    Heartbeat
>>>> (last)
>>>>       Total                                        58 70
>>>>           0 29                         9                 29
>>>>     3
>>>>       up    S144                               36 39
>>>>       0 20                         8                18 2
>>>>    59
>>>>       down    S143                           22 31
>>>>     0 9                           1                11 11
>>>>    58
>>>> /
>>>> I am not able to understand this stats.
>>>>
>>>> Please help.
>>>>
>>>> Reshu.
>>>>
>>>>
>>>>

Re: DUCC-unstable behaviour od ducc

Posted by Lou DeGenaro <lo...@gmail.com>.

Each node has a DUCC Agent daemon that sends heartbeats.

There was a bug discovered after the release of 1.1 whereby the share
calculation is incorrect (a rounding up problem that you observe).  The
impact of this bug should be minimal.  The bug has been fixed.

Lou.



On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal <re...@orkash.com>
wrote:

> Lou,
>
> How can a node send heartbeats in DUCC? If you can tell me this I will be
> able to identify problem of down in my nodes.
>
> The other problem which I am facing is:
>
> Memory(GB):total    :   31
> Memory(GB):usable :   16
> Shares:total             :    8
> Shares:inuse            :   9
>
>
> Means actual RAM which is available is 30 GB so shares available should be
> 15(2GB per share) but it is showing Memory(GB):usable :   16 and
> Shares:total             :    8.
>
> In DUCC 1.0.0, I don't face this problem.
>
> Please explain me its reason.
>
> Reshu.
>
>
>
> On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
>
>> Which of these are no understandable?  If you hover over the column
>> heading
>> a little more explanation is given (though not much).
>>
>> For example, If you hover over Heartbeat(last) you'll see "The elapsed
>> time
>> (in seconds) since the last heartbeat".  This should usually be around 60
>> seconds.  On the system I'm looking at live presently, I see a range from
>> 9
>> to 66.  If the number gets too large, the DUCC system will consider the
>> node down.  As best as I can tell, it looks like your numbers are 58 & 59
>> which is perfect.
>>
>> Lou.
>>
>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <re...@orkash.com>
>> wrote:
>>
>>  Hi,
>>>
>>> Please look this stats:
>>>
>>> /    Status    Name    Memory(GB):usable Memory(GB):total Swap(GB):inuse
>>>    Swap(GB):free    Alien PIDs    Shares:total Shares:inuse    Heartbeat
>>> (last)
>>>      Total                                        58 70
>>>          0 29                         9                 29
>>>    3
>>>      up    S144                               36 39
>>>      0 20                         8                18 2
>>>   59
>>>      down    S143                           22 31
>>>    0 9                           1                11 11
>>>   58
>>> /
>>> I am not able to understand this stats.
>>>
>>> Please help.
>>>
>>> Reshu.
>>>
>>>
>>>
>

Re: DUCC-unstable behaviour od ducc

Posted by "reshu.agarwal" <re...@orkash.com>.

Lou,

How can a node send heartbeats in DUCC? If you can tell me this I will 
be able to identify problem of down in my nodes.

The other problem which I am facing is:

Memory(GB):total    :   31
Memory(GB):usable :   16
Shares:total             :    8
Shares:inuse            :   9


Means actual RAM which is available is 30 GB so shares available should 
be 15(2GB per share) but it is showing Memory(GB):usable :   16 and 
Shares:total             :    8.

In DUCC 1.0.0, I don't face this problem.

Please explain me its reason.

Reshu.


On 12/04/2014 06:42 PM, Lou DeGenaro wrote:
> Which of these are no understandable?  If you hover over the column heading
> a little more explanation is given (though not much).
>
> For example, If you hover over Heartbeat(last) you'll see "The elapsed time
> (in seconds) since the last heartbeat".  This should usually be around 60
> seconds.  On the system I'm looking at live presently, I see a range from 9
> to 66.  If the number gets too large, the DUCC system will consider the
> node down.  As best as I can tell, it looks like your numbers are 58 & 59
> which is perfect.
>
> Lou.
>
> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <re...@orkash.com>
> wrote:
>
>> Hi,
>>
>> Please look this stats:
>>
>> /    Status    Name    Memory(GB):usable Memory(GB):total Swap(GB):inuse
>>    Swap(GB):free    Alien PIDs    Shares:total Shares:inuse    Heartbeat
>> (last)
>>      Total                                        58 70
>>          0 29                         9                 29
>>    3
>>      up    S144                               36 39
>>      0 20                         8                18 2                    59
>>      down    S143                           22 31
>>    0 9                           1                11 11                    58
>> /
>> I am not able to understand this stats.
>>
>> Please help.
>>
>> Reshu.
>>
>>

Re: DUCC-unstable behaviour od ducc

Posted by Lou DeGenaro <lo...@gmail.com>.

Which of these are no understandable?  If you hover over the column heading
a little more explanation is given (though not much).

For example, If you hover over Heartbeat(last) you'll see "The elapsed time
(in seconds) since the last heartbeat".  This should usually be around 60
seconds.  On the system I'm looking at live presently, I see a range from 9
to 66.  If the number gets too large, the DUCC system will consider the
node down.  As best as I can tell, it looks like your numbers are 58 & 59
which is perfect.

Lou.

On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal <re...@orkash.com>
wrote:

> Hi,
>
> Please look this stats:
>
> /    Status    Name    Memory(GB):usable Memory(GB):total Swap(GB):inuse
>   Swap(GB):free    Alien PIDs    Shares:total Shares:inuse    Heartbeat
> (last)
>     Total                                        58 70
>         0 29                         9                 29
>   3
>     up    S144                               36 39
>     0 20                         8                18 2                    59
>     down    S143                           22 31
>   0 9                           1                11 11                    58
> /
> I am not able to understand this stats.
>
> Please help.
>
> Reshu.
>
>