You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@vcl.apache.org by Junaid Ali <al...@iit.edu> on 2014/10/08 00:53:12 UTC

vcld memory leak

Hello,
I've recently been hitting a memory leak with the vcl daemon (VCL version
2.3.2). The problem appears to be happening in the computer_not_being_used
subroutine within new.pm (see attached log)

The problem appears to start when during a reload there was an issue
communicating with the VMWare server. This caused the VM to be left on the
VMHost in a powered off state along with the deletion of the entries from
the computerloadlog table

*|6309|19812:19812|reload| ---- CRITICAL ---- *
*|6309|19812:19812|reload| 2014-09-03
09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser error :
Start tag expected, '<' not found*
*|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection refused)*
*|6309|19812:19812|reload| ^*
*|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
*|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
*|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
*|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
*|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
*|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line:
172)*
*|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line: 1663)*
*|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
*|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line: 2471)*
*|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: 2096)*
*|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line:
1594)*
*|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
*|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image
(line: 671)*
*|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process (line:
291)*
*|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
*2014-09-03
09:45:51|6309|19812:19812|reload|utils.pm:delete_computerloadlog_reservation(6396)|removing
computerloadlog entries matching loadstate = begin*
*2014-09-03
09:45:51|6309|19812:19812|reload|utils.pm:delete_computerloadlog_reservation(6443)|deleted
rows from computerloadlog for reservation id=19812*

Now when a new reservation comes in and the same vm is allocated for the
reservation, the computer_not_being_used subroutine calls the
$self->code_loop_timeout(sub{return !reservation_being_processed(@_)},
[$competing_reservation_id], $message, $total_wait_seconds,
$attempt_delay_seconds)) section (on line # 815 in new.pm) it receives a 0
from reservation_being_processed with message

*"2014-10-07
11:45:54|8175|23084:23084|new|utils.pm:reservation_being_processed(8634)|computerloadlog
'begin' entry does NOT exist for reservation 19812" *

The vcl daemon thinks that the reload has completed. This causes the same
reservation to be processed over an over within computer_not_being_used
causing memory spike's and eventually killing that vcld thread.

Any ideas how the reservation_being_processed can handle the lack of
"begin" entries when used along with the code_loop_timeout from
computer_not_being_used or the DESTROY handler can make sure such
reservations are purged, so it doesn't cause this issue?

Please let me know if you need any further clarification.

Thanks.

-- 
Junaid Ali

Re: vcld memory leak

Posted by Andy Kurth <an...@ncsu.edu>.
Try this:
wget
https://svn.apache.org/repos/asf/vcl/sandbox/patches/vSphere_SDK_2.3.2_updated.pm

Regards,
Andy

On Thu, Oct 9, 2014 at 10:32 AM, Junaid Ali <al...@iit.edu> wrote:

> Hi Andy,
> Thanks for the reply and the quick patch.
> I didn't receive any attachment. Can you please resend the attachment?
>
> Thanks.
>
> On Wed, Oct 8, 2014 at 12:44 PM, Andy Kurth <an...@ncsu.edu> wrote:
>
> > The pids look correct.  Is this host currently generating any of the
> 'Start
> > tag expected...' error messages?  If it is then something else may be
> > wrong.  If not, then if you see the error in the future I would again
> check
> > the pids.  It's pretty simple but tedious.  In your file, the value
> > in vmware-hostd.PID (2953236) should match the PID of the
> > first hostd-worker process and the PPID of the other hostd-worker
> > processes.  The same goes for the other files and processes.
> >
> > I have updated the code in the repository to catch the error which caused
> > the process to die over and over again.  I also applied similar changes
> to
> > the vSphere_SDK.pm file which shipped with 2.3.2 to the attached file.
> You
> > can try to swap your current file:
> > /usr/local/vcl/lib/VCL/Module/Provisioning/vSphere_SDK.pm
> > ...with the attached file and restart the vcld service.
> >
> > I tested this on a simple reload and it worked.  The updated file won't
> > prevent the initial error from occurring but it will catch the problem so
> > that the vcld process doesn't abruptly die and repeatedly retry.
> >
> > -Andy
> >
> >
> > On Wed, Oct 8, 2014 at 12:35 PM, Junaid Ali <al...@iit.edu> wrote:
> >
> >> Hi Andy,
> >> Thanks for the information.
> >> I was able to ssh into the vmhost and run the commands. It runs the
> >> vim-cmd commands without any errors. Attached are the PID from the
> files as
> >> well as ps command and they are consistent. So it may not be related to
> the
> >> PID mismatch. I looked back during the week of 9/3 when the original
> >> problem occurred and could not find anything out of place on the VMHost
> >> (please check attached pdf report from vcenter) or the management node.
> >>
> >> Thanks.
> >> Junaid.
> >>
> >> On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <an...@ncsu.edu> wrote:
> >>
> >>> There are probably 2 related problems -- (1) the health of the ESXi
> >>> server
> >>> and (2) the VCL code not handling all cases when the health of the host
> >>> causes unexpected results.  More below...
> >>>
> >>> On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <al...@iit.edu> wrote:
> >>>
> >>> > Hello,
> >>> > I've recently been hitting a memory leak with the vcl daemon (VCL
> >>> version
> >>> > 2.3.2). The problem appears to be happening in the
> >>> computer_not_being_used
> >>> > subroutine within new.pm (see attached log)
> >>> >
> >>> > The problem appears to start when during a reload there was an issue
> >>> > communicating with the VMWare server. This caused the VM to be left
> on
> >>> the
> >>> > VMHost in a powered off state along with the deletion of the entries
> >>> from
> >>> > the computerloadlog table
> >>> >
> >>> > *|6309|19812:19812|reload| ---- CRITICAL ---- *
> >>> > *|6309|19812:19812|reload| 2014-09-03
> >>> > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser
> >>> error :
> >>> > Start tag expected, '<' not found*
> >>> > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection
> >>> refused)*
> >>> > *|6309|19812:19812|reload| ^*
> >>> > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
> >>> > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
> >>> > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
> >>> > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
> >>> > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
> >>> > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties
> (line:
> >>> > 172)*
> >>> > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line:
> >>> 1663)*
> >>> > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
> >>> > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line:
> >>> > 2471)*
> >>> > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line:
> >>> 2096)*
> >>> > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms
> (line:
> >>> > 1594)*
> >>> > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
> >>> > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>,
> reload_image
> >>> > (line: 671)*
> >>> > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process
> >>> (line:
> >>> > 291)*
> >>> > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
> >>> > *2014-09-03
> >>> > 09:45:51|6309|19812:19812|reload|utils.pm:
> >>> delete_computerloadlog_reservation(6396)|removing
> >>> > computerloadlog entries matching loadstate = begin*
> >>> > *2014-09-03
> >>> > 09:45:51|6309|19812:19812|reload|utils.pm:
> >>> delete_computerloadlog_reservation(6443)|deleted
> >>> > rows from computerloadlog for reservation id=19812*
> >>>
> >>> >
> >>> >
> >>> Yes.  We are seeing this more and more as of late on our ESXi 4.1
> >>> servers.
> >>> This particular error only appears if you are using the vSphere SDK to
> >>> manage the host.  I believe the same underlying problem is described in
> >>> the
> >>> following issue if SSH and vim-cmd is used to manage the host:
> >>> https://issues.apache.org/jira/browse/VCL-769
> >>>
> >>> As a test on a server which is exhibiting the problem you described and
> >>> to
> >>> determine if the problems are related, please try to SSH in and run the
> >>> following command:
> >>> vim-cmd hostsvc/datastore/info
> >>>
> >>> If this displays an error then they are related.  Running 'services.sh
> >>> restart' on the host may solve the problem.  If not, then it's likely
> the
> >>> .pid files in /var/run became inconsistent with the running services.
> >>> Each
> >>> should contain the PID of the corresponding service.  If they contain
> the
> >>> wrong PID then 'services.sh restart' will fail to restart some services
> >>> and
> >>> the problems will continue.  If you verify that 'services.sh restart'
> >>> doesn't fix the issue, I can try to write instructions on how to fix
> the
> >>> files manually.  I have added some code to VIM_SSH.pm to try to correct
> >>> the
> >>> .pid files automatically.  This isn't possible with the vSphere SDK.
> >>>
> >>> Please send the contents of each of these files from an affected host:
> >>> /var/run/vmware/vmware-hostd.PID
> >>> /var/run/vmware/vicimprovider.PID
> >>> /var/run/vmware/vmkdevmgr.pid
> >>> /var/run/vmware/vmkeventd.pid
> >>> /var/run/vmware/vmsyslogd.pid
> >>> /var/run/vmware/vmware-rhttpproxy.PID
> >>> /var/run/vmware/vmware-vpxa.PID
> >>>
> >>> And the output from these commands:
> >>> ps -ef | grep hostd-worker
> >>> ps -ef | grep sfcb-vmware_bas
> >>> ps -ef | grep vmkdevmgr
> >>> ps -ef | grep vmkeventd
> >>> ps -ef | grep vmsyslogd
> >>> ps -ef | grep rhttpproxy-work
> >>> ps -ef | grep vpxa-worker
> >>>
> >>>
> >>>
> >>> > Now when a new reservation comes in and the same vm is allocated for
> >>> the
> >>> > reservation, the computer_not_being_used subroutine calls the
> >>> > $self->code_loop_timeout(sub{return
> !reservation_being_processed(@_)},
> >>> > [$competing_reservation_id], $message, $total_wait_seconds,
> >>> > $attempt_delay_seconds)) section (on line # 815 in new.pm) it
> >>> receives a
> >>> > 0 from reservation_being_processed with message
> >>> >
> >>> > *"2014-10-07
> >>> > 11:45:54|8175|23084:23084|new|utils.pm:
> >>> reservation_being_processed(8634)|computerloadlog
> >>> > 'begin' entry does NOT exist for reservation 19812" *
> >>>
> >>> >
> >>> > The vcl daemon thinks that the reload has completed. This causes the
> >>> same
> >>> > reservation to be processed over an over within
> computer_not_being_used
> >>> > causing memory spike's and eventually killing that vcld thread.
> >>> >
> >>> > Any ideas how the reservation_being_processed can handle the lack of
> >>> > "begin" entries when used along with the code_loop_timeout from
> >>> > computer_not_being_used or the DESTROY handler can make sure such
> >>> > reservations are purged, so it doesn't cause this issue?
> >>> >
> >>> >
> >>> The VCL code could be improved to better handle this problem.
> >>>
> >>> The problem is probably due to when the error occurs in the vcld
> process
> >>> sequence -- very early on when the object to manage the VM host is
> being
> >>> initialized.  From the trace output, the vSphere_SDK.pm::_get_file_info
> >>> subroutine is calling the vSphere SDK's Vim::get_view subroutine.  This
> >>> fails miserably (probably due to the SDK not catching the error
> described
> >>> above) and causes the entire vcld child process to die abruptly.  The
> >>> problem occurs before the vcld changed the request state to 'pending'.
> >>> The
> >>> request state/laststate remains to be reload/reload or new/new when the
> >>> process dies.  As a result, vcld keeps trying the same sequence over
> and
> >>> over again.
> >>>
> >>> It's possible to improve the code to catch this by wrapping all Vim::*
> >>> calls in an eval block.  I'll get this implemented for the next
> release.
> >>> Patching 2.3.2 may be possible but could also be ugly.  Vim::get_view
> is
> >>> called from many places in vSphere_SDK.pm.  Every one of these would
> need
> >>> to be updated.
> >>>
> >>> Regards,
> >>> Andy
> >>>
> >>>
> >>> > Please let me know if you need any further clarification.
> >>> >
> >>> > Thanks.
> >>> >
> >>> > --
> >>> > Junaid Ali
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Junaid Ali
> >> Systems & Virtualization Engineer,
> >> Office of Technology Services/IIT,
> >> 10W, 31st Street,
> >> Stuart Building Room # 007,
> >> Chicago, IL - 60616
> >> Ph (O): 312-567-5836
> >> Ph (F): 312-567-5968
> >>
> >
> >
>
>
> --
> Junaid Ali
>

Re: vcld memory leak

Posted by Junaid Ali <al...@iit.edu>.
Hi Andy,
Thanks for the reply and the quick patch.
I didn't receive any attachment. Can you please resend the attachment?

Thanks.

On Wed, Oct 8, 2014 at 12:44 PM, Andy Kurth <an...@ncsu.edu> wrote:

> The pids look correct.  Is this host currently generating any of the 'Start
> tag expected...' error messages?  If it is then something else may be
> wrong.  If not, then if you see the error in the future I would again check
> the pids.  It's pretty simple but tedious.  In your file, the value
> in vmware-hostd.PID (2953236) should match the PID of the
> first hostd-worker process and the PPID of the other hostd-worker
> processes.  The same goes for the other files and processes.
>
> I have updated the code in the repository to catch the error which caused
> the process to die over and over again.  I also applied similar changes to
> the vSphere_SDK.pm file which shipped with 2.3.2 to the attached file.  You
> can try to swap your current file:
> /usr/local/vcl/lib/VCL/Module/Provisioning/vSphere_SDK.pm
> ...with the attached file and restart the vcld service.
>
> I tested this on a simple reload and it worked.  The updated file won't
> prevent the initial error from occurring but it will catch the problem so
> that the vcld process doesn't abruptly die and repeatedly retry.
>
> -Andy
>
>
> On Wed, Oct 8, 2014 at 12:35 PM, Junaid Ali <al...@iit.edu> wrote:
>
>> Hi Andy,
>> Thanks for the information.
>> I was able to ssh into the vmhost and run the commands. It runs the
>> vim-cmd commands without any errors. Attached are the PID from the files as
>> well as ps command and they are consistent. So it may not be related to the
>> PID mismatch. I looked back during the week of 9/3 when the original
>> problem occurred and could not find anything out of place on the VMHost
>> (please check attached pdf report from vcenter) or the management node.
>>
>> Thanks.
>> Junaid.
>>
>> On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <an...@ncsu.edu> wrote:
>>
>>> There are probably 2 related problems -- (1) the health of the ESXi
>>> server
>>> and (2) the VCL code not handling all cases when the health of the host
>>> causes unexpected results.  More below...
>>>
>>> On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <al...@iit.edu> wrote:
>>>
>>> > Hello,
>>> > I've recently been hitting a memory leak with the vcl daemon (VCL
>>> version
>>> > 2.3.2). The problem appears to be happening in the
>>> computer_not_being_used
>>> > subroutine within new.pm (see attached log)
>>> >
>>> > The problem appears to start when during a reload there was an issue
>>> > communicating with the VMWare server. This caused the VM to be left on
>>> the
>>> > VMHost in a powered off state along with the deletion of the entries
>>> from
>>> > the computerloadlog table
>>> >
>>> > *|6309|19812:19812|reload| ---- CRITICAL ---- *
>>> > *|6309|19812:19812|reload| 2014-09-03
>>> > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser
>>> error :
>>> > Start tag expected, '<' not found*
>>> > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection
>>> refused)*
>>> > *|6309|19812:19812|reload| ^*
>>> > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
>>> > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
>>> > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
>>> > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
>>> > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
>>> > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line:
>>> > 172)*
>>> > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line:
>>> 1663)*
>>> > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
>>> > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line:
>>> > 2471)*
>>> > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line:
>>> 2096)*
>>> > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line:
>>> > 1594)*
>>> > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
>>> > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image
>>> > (line: 671)*
>>> > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process
>>> (line:
>>> > 291)*
>>> > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
>>> > *2014-09-03
>>> > 09:45:51|6309|19812:19812|reload|utils.pm:
>>> delete_computerloadlog_reservation(6396)|removing
>>> > computerloadlog entries matching loadstate = begin*
>>> > *2014-09-03
>>> > 09:45:51|6309|19812:19812|reload|utils.pm:
>>> delete_computerloadlog_reservation(6443)|deleted
>>> > rows from computerloadlog for reservation id=19812*
>>>
>>> >
>>> >
>>> Yes.  We are seeing this more and more as of late on our ESXi 4.1
>>> servers.
>>> This particular error only appears if you are using the vSphere SDK to
>>> manage the host.  I believe the same underlying problem is described in
>>> the
>>> following issue if SSH and vim-cmd is used to manage the host:
>>> https://issues.apache.org/jira/browse/VCL-769
>>>
>>> As a test on a server which is exhibiting the problem you described and
>>> to
>>> determine if the problems are related, please try to SSH in and run the
>>> following command:
>>> vim-cmd hostsvc/datastore/info
>>>
>>> If this displays an error then they are related.  Running 'services.sh
>>> restart' on the host may solve the problem.  If not, then it's likely the
>>> .pid files in /var/run became inconsistent with the running services.
>>> Each
>>> should contain the PID of the corresponding service.  If they contain the
>>> wrong PID then 'services.sh restart' will fail to restart some services
>>> and
>>> the problems will continue.  If you verify that 'services.sh restart'
>>> doesn't fix the issue, I can try to write instructions on how to fix the
>>> files manually.  I have added some code to VIM_SSH.pm to try to correct
>>> the
>>> .pid files automatically.  This isn't possible with the vSphere SDK.
>>>
>>> Please send the contents of each of these files from an affected host:
>>> /var/run/vmware/vmware-hostd.PID
>>> /var/run/vmware/vicimprovider.PID
>>> /var/run/vmware/vmkdevmgr.pid
>>> /var/run/vmware/vmkeventd.pid
>>> /var/run/vmware/vmsyslogd.pid
>>> /var/run/vmware/vmware-rhttpproxy.PID
>>> /var/run/vmware/vmware-vpxa.PID
>>>
>>> And the output from these commands:
>>> ps -ef | grep hostd-worker
>>> ps -ef | grep sfcb-vmware_bas
>>> ps -ef | grep vmkdevmgr
>>> ps -ef | grep vmkeventd
>>> ps -ef | grep vmsyslogd
>>> ps -ef | grep rhttpproxy-work
>>> ps -ef | grep vpxa-worker
>>>
>>>
>>>
>>> > Now when a new reservation comes in and the same vm is allocated for
>>> the
>>> > reservation, the computer_not_being_used subroutine calls the
>>> > $self->code_loop_timeout(sub{return !reservation_being_processed(@_)},
>>> > [$competing_reservation_id], $message, $total_wait_seconds,
>>> > $attempt_delay_seconds)) section (on line # 815 in new.pm) it
>>> receives a
>>> > 0 from reservation_being_processed with message
>>> >
>>> > *"2014-10-07
>>> > 11:45:54|8175|23084:23084|new|utils.pm:
>>> reservation_being_processed(8634)|computerloadlog
>>> > 'begin' entry does NOT exist for reservation 19812" *
>>>
>>> >
>>> > The vcl daemon thinks that the reload has completed. This causes the
>>> same
>>> > reservation to be processed over an over within computer_not_being_used
>>> > causing memory spike's and eventually killing that vcld thread.
>>> >
>>> > Any ideas how the reservation_being_processed can handle the lack of
>>> > "begin" entries when used along with the code_loop_timeout from
>>> > computer_not_being_used or the DESTROY handler can make sure such
>>> > reservations are purged, so it doesn't cause this issue?
>>> >
>>> >
>>> The VCL code could be improved to better handle this problem.
>>>
>>> The problem is probably due to when the error occurs in the vcld process
>>> sequence -- very early on when the object to manage the VM host is being
>>> initialized.  From the trace output, the vSphere_SDK.pm::_get_file_info
>>> subroutine is calling the vSphere SDK's Vim::get_view subroutine.  This
>>> fails miserably (probably due to the SDK not catching the error described
>>> above) and causes the entire vcld child process to die abruptly.  The
>>> problem occurs before the vcld changed the request state to 'pending'.
>>> The
>>> request state/laststate remains to be reload/reload or new/new when the
>>> process dies.  As a result, vcld keeps trying the same sequence over and
>>> over again.
>>>
>>> It's possible to improve the code to catch this by wrapping all Vim::*
>>> calls in an eval block.  I'll get this implemented for the next release.
>>> Patching 2.3.2 may be possible but could also be ugly.  Vim::get_view is
>>> called from many places in vSphere_SDK.pm.  Every one of these would need
>>> to be updated.
>>>
>>> Regards,
>>> Andy
>>>
>>>
>>> > Please let me know if you need any further clarification.
>>> >
>>> > Thanks.
>>> >
>>> > --
>>> > Junaid Ali
>>> >
>>>
>>
>>
>>
>> --
>> Junaid Ali
>> Systems & Virtualization Engineer,
>> Office of Technology Services/IIT,
>> 10W, 31st Street,
>> Stuart Building Room # 007,
>> Chicago, IL - 60616
>> Ph (O): 312-567-5836
>> Ph (F): 312-567-5968
>>
>
>


-- 
Junaid Ali

Re: vcld memory leak

Posted by Andy Kurth <an...@ncsu.edu>.
The pids look correct.  Is this host currently generating any of the 'Start
tag expected...' error messages?  If it is then something else may be
wrong.  If not, then if you see the error in the future I would again check
the pids.  It's pretty simple but tedious.  In your file, the value
in vmware-hostd.PID (2953236) should match the PID of the
first hostd-worker process and the PPID of the other hostd-worker
processes.  The same goes for the other files and processes.

I have updated the code in the repository to catch the error which caused
the process to die over and over again.  I also applied similar changes to
the vSphere_SDK.pm file which shipped with 2.3.2 to the attached file.  You
can try to swap your current file:
/usr/local/vcl/lib/VCL/Module/Provisioning/vSphere_SDK.pm
...with the attached file and restart the vcld service.

I tested this on a simple reload and it worked.  The updated file won't
prevent the initial error from occurring but it will catch the problem so
that the vcld process doesn't abruptly die and repeatedly retry.

-Andy


On Wed, Oct 8, 2014 at 12:35 PM, Junaid Ali <al...@iit.edu> wrote:

> Hi Andy,
> Thanks for the information.
> I was able to ssh into the vmhost and run the commands. It runs the
> vim-cmd commands without any errors. Attached are the PID from the files as
> well as ps command and they are consistent. So it may not be related to the
> PID mismatch. I looked back during the week of 9/3 when the original
> problem occurred and could not find anything out of place on the VMHost
> (please check attached pdf report from vcenter) or the management node.
>
> Thanks.
> Junaid.
>
> On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <an...@ncsu.edu> wrote:
>
>> There are probably 2 related problems -- (1) the health of the ESXi server
>> and (2) the VCL code not handling all cases when the health of the host
>> causes unexpected results.  More below...
>>
>> On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <al...@iit.edu> wrote:
>>
>> > Hello,
>> > I've recently been hitting a memory leak with the vcl daemon (VCL
>> version
>> > 2.3.2). The problem appears to be happening in the
>> computer_not_being_used
>> > subroutine within new.pm (see attached log)
>> >
>> > The problem appears to start when during a reload there was an issue
>> > communicating with the VMWare server. This caused the VM to be left on
>> the
>> > VMHost in a powered off state along with the deletion of the entries
>> from
>> > the computerloadlog table
>> >
>> > *|6309|19812:19812|reload| ---- CRITICAL ---- *
>> > *|6309|19812:19812|reload| 2014-09-03
>> > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser error
>> :
>> > Start tag expected, '<' not found*
>> > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection
>> refused)*
>> > *|6309|19812:19812|reload| ^*
>> > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
>> > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
>> > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
>> > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
>> > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
>> > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line:
>> > 172)*
>> > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line:
>> 1663)*
>> > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
>> > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line:
>> > 2471)*
>> > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: 2096)*
>> > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line:
>> > 1594)*
>> > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
>> > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image
>> > (line: 671)*
>> > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process (line:
>> > 291)*
>> > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
>> > *2014-09-03
>> > 09:45:51|6309|19812:19812|reload|utils.pm:
>> delete_computerloadlog_reservation(6396)|removing
>> > computerloadlog entries matching loadstate = begin*
>> > *2014-09-03
>> > 09:45:51|6309|19812:19812|reload|utils.pm:
>> delete_computerloadlog_reservation(6443)|deleted
>> > rows from computerloadlog for reservation id=19812*
>>
>> >
>> >
>> Yes.  We are seeing this more and more as of late on our ESXi 4.1 servers.
>> This particular error only appears if you are using the vSphere SDK to
>> manage the host.  I believe the same underlying problem is described in
>> the
>> following issue if SSH and vim-cmd is used to manage the host:
>> https://issues.apache.org/jira/browse/VCL-769
>>
>> As a test on a server which is exhibiting the problem you described and to
>> determine if the problems are related, please try to SSH in and run the
>> following command:
>> vim-cmd hostsvc/datastore/info
>>
>> If this displays an error then they are related.  Running 'services.sh
>> restart' on the host may solve the problem.  If not, then it's likely the
>> .pid files in /var/run became inconsistent with the running services.
>> Each
>> should contain the PID of the corresponding service.  If they contain the
>> wrong PID then 'services.sh restart' will fail to restart some services
>> and
>> the problems will continue.  If you verify that 'services.sh restart'
>> doesn't fix the issue, I can try to write instructions on how to fix the
>> files manually.  I have added some code to VIM_SSH.pm to try to correct
>> the
>> .pid files automatically.  This isn't possible with the vSphere SDK.
>>
>> Please send the contents of each of these files from an affected host:
>> /var/run/vmware/vmware-hostd.PID
>> /var/run/vmware/vicimprovider.PID
>> /var/run/vmware/vmkdevmgr.pid
>> /var/run/vmware/vmkeventd.pid
>> /var/run/vmware/vmsyslogd.pid
>> /var/run/vmware/vmware-rhttpproxy.PID
>> /var/run/vmware/vmware-vpxa.PID
>>
>> And the output from these commands:
>> ps -ef | grep hostd-worker
>> ps -ef | grep sfcb-vmware_bas
>> ps -ef | grep vmkdevmgr
>> ps -ef | grep vmkeventd
>> ps -ef | grep vmsyslogd
>> ps -ef | grep rhttpproxy-work
>> ps -ef | grep vpxa-worker
>>
>>
>>
>> > Now when a new reservation comes in and the same vm is allocated for the
>> > reservation, the computer_not_being_used subroutine calls the
>> > $self->code_loop_timeout(sub{return !reservation_being_processed(@_)},
>> > [$competing_reservation_id], $message, $total_wait_seconds,
>> > $attempt_delay_seconds)) section (on line # 815 in new.pm) it receives
>> a
>> > 0 from reservation_being_processed with message
>> >
>> > *"2014-10-07
>> > 11:45:54|8175|23084:23084|new|utils.pm:
>> reservation_being_processed(8634)|computerloadlog
>> > 'begin' entry does NOT exist for reservation 19812" *
>>
>> >
>> > The vcl daemon thinks that the reload has completed. This causes the
>> same
>> > reservation to be processed over an over within computer_not_being_used
>> > causing memory spike's and eventually killing that vcld thread.
>> >
>> > Any ideas how the reservation_being_processed can handle the lack of
>> > "begin" entries when used along with the code_loop_timeout from
>> > computer_not_being_used or the DESTROY handler can make sure such
>> > reservations are purged, so it doesn't cause this issue?
>> >
>> >
>> The VCL code could be improved to better handle this problem.
>>
>> The problem is probably due to when the error occurs in the vcld process
>> sequence -- very early on when the object to manage the VM host is being
>> initialized.  From the trace output, the vSphere_SDK.pm::_get_file_info
>> subroutine is calling the vSphere SDK's Vim::get_view subroutine.  This
>> fails miserably (probably due to the SDK not catching the error described
>> above) and causes the entire vcld child process to die abruptly.  The
>> problem occurs before the vcld changed the request state to 'pending'.
>> The
>> request state/laststate remains to be reload/reload or new/new when the
>> process dies.  As a result, vcld keeps trying the same sequence over and
>> over again.
>>
>> It's possible to improve the code to catch this by wrapping all Vim::*
>> calls in an eval block.  I'll get this implemented for the next release.
>> Patching 2.3.2 may be possible but could also be ugly.  Vim::get_view is
>> called from many places in vSphere_SDK.pm.  Every one of these would need
>> to be updated.
>>
>> Regards,
>> Andy
>>
>>
>> > Please let me know if you need any further clarification.
>> >
>> > Thanks.
>> >
>> > --
>> > Junaid Ali
>> >
>>
>
>
>
> --
> Junaid Ali
> Systems & Virtualization Engineer,
> Office of Technology Services/IIT,
> 10W, 31st Street,
> Stuart Building Room # 007,
> Chicago, IL - 60616
> Ph (O): 312-567-5836
> Ph (F): 312-567-5968
>

Re: vcld memory leak

Posted by Junaid Ali <al...@iit.edu>.
Hi Andy,
Thanks for the information.
I was able to ssh into the vmhost and run the commands. It runs the vim-cmd
commands without any errors. Attached are the PID from the files as well as
ps command and they are consistent. So it may not be related to the PID
mismatch. I looked back during the week of 9/3 when the original problem
occurred and could not find anything out of place on the VMHost (please
check attached pdf report from vcenter) or the management node.

Thanks.
Junaid.

On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <an...@ncsu.edu> wrote:

> There are probably 2 related problems -- (1) the health of the ESXi server
> and (2) the VCL code not handling all cases when the health of the host
> causes unexpected results.  More below...
>
> On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <al...@iit.edu> wrote:
>
> > Hello,
> > I've recently been hitting a memory leak with the vcl daemon (VCL version
> > 2.3.2). The problem appears to be happening in the
> computer_not_being_used
> > subroutine within new.pm (see attached log)
> >
> > The problem appears to start when during a reload there was an issue
> > communicating with the VMWare server. This caused the VM to be left on
> the
> > VMHost in a powered off state along with the deletion of the entries from
> > the computerloadlog table
> >
> > *|6309|19812:19812|reload| ---- CRITICAL ---- *
> > *|6309|19812:19812|reload| 2014-09-03
> > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser error :
> > Start tag expected, '<' not found*
> > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection
> refused)*
> > *|6309|19812:19812|reload| ^*
> > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
> > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
> > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
> > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
> > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
> > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line:
> > 172)*
> > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line:
> 1663)*
> > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
> > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line:
> > 2471)*
> > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: 2096)*
> > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line:
> > 1594)*
> > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
> > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image
> > (line: 671)*
> > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process (line:
> > 291)*
> > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
> > *2014-09-03
> > 09:45:51|6309|19812:19812|reload|utils.pm:
> delete_computerloadlog_reservation(6396)|removing
> > computerloadlog entries matching loadstate = begin*
> > *2014-09-03
> > 09:45:51|6309|19812:19812|reload|utils.pm:
> delete_computerloadlog_reservation(6443)|deleted
> > rows from computerloadlog for reservation id=19812*
> >
> >
> Yes.  We are seeing this more and more as of late on our ESXi 4.1 servers.
> This particular error only appears if you are using the vSphere SDK to
> manage the host.  I believe the same underlying problem is described in the
> following issue if SSH and vim-cmd is used to manage the host:
> https://issues.apache.org/jira/browse/VCL-769
>
> As a test on a server which is exhibiting the problem you described and to
> determine if the problems are related, please try to SSH in and run the
> following command:
> vim-cmd hostsvc/datastore/info
>
> If this displays an error then they are related.  Running 'services.sh
> restart' on the host may solve the problem.  If not, then it's likely the
> .pid files in /var/run became inconsistent with the running services.  Each
> should contain the PID of the corresponding service.  If they contain the
> wrong PID then 'services.sh restart' will fail to restart some services and
> the problems will continue.  If you verify that 'services.sh restart'
> doesn't fix the issue, I can try to write instructions on how to fix the
> files manually.  I have added some code to VIM_SSH.pm to try to correct the
> .pid files automatically.  This isn't possible with the vSphere SDK.
>
> Please send the contents of each of these files from an affected host:
> /var/run/vmware/vmware-hostd.PID
> /var/run/vmware/vicimprovider.PID
> /var/run/vmware/vmkdevmgr.pid
> /var/run/vmware/vmkeventd.pid
> /var/run/vmware/vmsyslogd.pid
> /var/run/vmware/vmware-rhttpproxy.PID
> /var/run/vmware/vmware-vpxa.PID
>
> And the output from these commands:
> ps -ef | grep hostd-worker
> ps -ef | grep sfcb-vmware_bas
> ps -ef | grep vmkdevmgr
> ps -ef | grep vmkeventd
> ps -ef | grep vmsyslogd
> ps -ef | grep rhttpproxy-work
> ps -ef | grep vpxa-worker
>
>
>
> > Now when a new reservation comes in and the same vm is allocated for the
> > reservation, the computer_not_being_used subroutine calls the
> > $self->code_loop_timeout(sub{return !reservation_being_processed(@_)},
> > [$competing_reservation_id], $message, $total_wait_seconds,
> > $attempt_delay_seconds)) section (on line # 815 in new.pm) it receives a
> > 0 from reservation_being_processed with message
> >
> > *"2014-10-07
> > 11:45:54|8175|23084:23084|new|utils.pm:
> reservation_being_processed(8634)|computerloadlog
> > 'begin' entry does NOT exist for reservation 19812" *
> >
> > The vcl daemon thinks that the reload has completed. This causes the same
> > reservation to be processed over an over within computer_not_being_used
> > causing memory spike's and eventually killing that vcld thread.
> >
> > Any ideas how the reservation_being_processed can handle the lack of
> > "begin" entries when used along with the code_loop_timeout from
> > computer_not_being_used or the DESTROY handler can make sure such
> > reservations are purged, so it doesn't cause this issue?
> >
> >
> The VCL code could be improved to better handle this problem.
>
> The problem is probably due to when the error occurs in the vcld process
> sequence -- very early on when the object to manage the VM host is being
> initialized.  From the trace output, the vSphere_SDK.pm::_get_file_info
> subroutine is calling the vSphere SDK's Vim::get_view subroutine.  This
> fails miserably (probably due to the SDK not catching the error described
> above) and causes the entire vcld child process to die abruptly.  The
> problem occurs before the vcld changed the request state to 'pending'.  The
> request state/laststate remains to be reload/reload or new/new when the
> process dies.  As a result, vcld keeps trying the same sequence over and
> over again.
>
> It's possible to improve the code to catch this by wrapping all Vim::*
> calls in an eval block.  I'll get this implemented for the next release.
> Patching 2.3.2 may be possible but could also be ugly.  Vim::get_view is
> called from many places in vSphere_SDK.pm.  Every one of these would need
> to be updated.
>
> Regards,
> Andy
>
>
> > Please let me know if you need any further clarification.
> >
> > Thanks.
> >
> > --
> > Junaid Ali
> >
>



-- 
Junaid Ali
Systems & Virtualization Engineer,
Office of Technology Services/IIT,
10W, 31st Street,
Stuart Building Room # 007,
Chicago, IL - 60616
Ph (O): 312-567-5836
Ph (F): 312-567-5968

Re: vcld memory leak

Posted by Andy Kurth <an...@ncsu.edu>.
There are probably 2 related problems -- (1) the health of the ESXi server
and (2) the VCL code not handling all cases when the health of the host
causes unexpected results.  More below...

On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <al...@iit.edu> wrote:

> Hello,
> I've recently been hitting a memory leak with the vcl daemon (VCL version
> 2.3.2). The problem appears to be happening in the computer_not_being_used
> subroutine within new.pm (see attached log)
>
> The problem appears to start when during a reload there was an issue
> communicating with the VMWare server. This caused the VM to be left on the
> VMHost in a powered off state along with the deletion of the entries from
> the computerloadlog table
>
> *|6309|19812:19812|reload| ---- CRITICAL ---- *
> *|6309|19812:19812|reload| 2014-09-03
> 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser error :
> Start tag expected, '<' not found*
> *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection refused)*
> *|6309|19812:19812|reload| ^*
> *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
> *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
> *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
> *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
> *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
> *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line:
> 172)*
> *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line: 1663)*
> *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
> *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line:
> 2471)*
> *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: 2096)*
> *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line:
> 1594)*
> *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
> *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image
> (line: 671)*
> *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process (line:
> 291)*
> *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
> *2014-09-03
> 09:45:51|6309|19812:19812|reload|utils.pm:delete_computerloadlog_reservation(6396)|removing
> computerloadlog entries matching loadstate = begin*
> *2014-09-03
> 09:45:51|6309|19812:19812|reload|utils.pm:delete_computerloadlog_reservation(6443)|deleted
> rows from computerloadlog for reservation id=19812*
>
>
Yes.  We are seeing this more and more as of late on our ESXi 4.1 servers.
This particular error only appears if you are using the vSphere SDK to
manage the host.  I believe the same underlying problem is described in the
following issue if SSH and vim-cmd is used to manage the host:
https://issues.apache.org/jira/browse/VCL-769

As a test on a server which is exhibiting the problem you described and to
determine if the problems are related, please try to SSH in and run the
following command:
vim-cmd hostsvc/datastore/info

If this displays an error then they are related.  Running 'services.sh
restart' on the host may solve the problem.  If not, then it's likely the
.pid files in /var/run became inconsistent with the running services.  Each
should contain the PID of the corresponding service.  If they contain the
wrong PID then 'services.sh restart' will fail to restart some services and
the problems will continue.  If you verify that 'services.sh restart'
doesn't fix the issue, I can try to write instructions on how to fix the
files manually.  I have added some code to VIM_SSH.pm to try to correct the
.pid files automatically.  This isn't possible with the vSphere SDK.

Please send the contents of each of these files from an affected host:
/var/run/vmware/vmware-hostd.PID
/var/run/vmware/vicimprovider.PID
/var/run/vmware/vmkdevmgr.pid
/var/run/vmware/vmkeventd.pid
/var/run/vmware/vmsyslogd.pid
/var/run/vmware/vmware-rhttpproxy.PID
/var/run/vmware/vmware-vpxa.PID

And the output from these commands:
ps -ef | grep hostd-worker
ps -ef | grep sfcb-vmware_bas
ps -ef | grep vmkdevmgr
ps -ef | grep vmkeventd
ps -ef | grep vmsyslogd
ps -ef | grep rhttpproxy-work
ps -ef | grep vpxa-worker



> Now when a new reservation comes in and the same vm is allocated for the
> reservation, the computer_not_being_used subroutine calls the
> $self->code_loop_timeout(sub{return !reservation_being_processed(@_)},
> [$competing_reservation_id], $message, $total_wait_seconds,
> $attempt_delay_seconds)) section (on line # 815 in new.pm) it receives a
> 0 from reservation_being_processed with message
>
> *"2014-10-07
> 11:45:54|8175|23084:23084|new|utils.pm:reservation_being_processed(8634)|computerloadlog
> 'begin' entry does NOT exist for reservation 19812" *
>
> The vcl daemon thinks that the reload has completed. This causes the same
> reservation to be processed over an over within computer_not_being_used
> causing memory spike's and eventually killing that vcld thread.
>
> Any ideas how the reservation_being_processed can handle the lack of
> "begin" entries when used along with the code_loop_timeout from
> computer_not_being_used or the DESTROY handler can make sure such
> reservations are purged, so it doesn't cause this issue?
>
>
The VCL code could be improved to better handle this problem.

The problem is probably due to when the error occurs in the vcld process
sequence -- very early on when the object to manage the VM host is being
initialized.  From the trace output, the vSphere_SDK.pm::_get_file_info
subroutine is calling the vSphere SDK's Vim::get_view subroutine.  This
fails miserably (probably due to the SDK not catching the error described
above) and causes the entire vcld child process to die abruptly.  The
problem occurs before the vcld changed the request state to 'pending'.  The
request state/laststate remains to be reload/reload or new/new when the
process dies.  As a result, vcld keeps trying the same sequence over and
over again.

It's possible to improve the code to catch this by wrapping all Vim::*
calls in an eval block.  I'll get this implemented for the next release.
Patching 2.3.2 may be possible but could also be ugly.  Vim::get_view is
called from many places in vSphere_SDK.pm.  Every one of these would need
to be updated.

Regards,
Andy


> Please let me know if you need any further clarification.
>
> Thanks.
>
> --
> Junaid Ali
>