You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by Andrew Onischuk <ao...@hortonworks.com> on 2015/11/23 16:21:10 UTC

Review Request 40600: Service or component install fails when a non-ambari apt-get command is running

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40600/
-----------------------------------------------------------

Review request for Ambari and Dmitro Lisnichenko.


Bugs: AMBARI-14017
    https://issues.apache.org/jira/browse/AMBARI-14017


Repository: ambari


Description
-------

PROBLEM  
Customer Microsoft Research notes that they routinely run "apt-get check" via
a cron job on their servers to check for broken dependencies. They report this
command may take up to two minutes to complete on various nodes in their
cluster. This command locks the package database via a write lock on
/var/lib/dpkg/lock. During that interval, if Ambari is commanded to install a
new component or perform other maintenance tasks on a cluster node that
require access to the package database, the command will fail. Since the apt-
get check is cron, apparently with some frequency, this represents a problem
for ongoing maintenance, especially in large clusters.

It would be desirable if ambari and/or the agent were more fault tolerant of
locks on the package database.

The stack trace at failure follows  
Traceback (most recent call last):  
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/hook.py", line 37, in <module>  
BeforeInstallHook().execute()  
File "/usr/lib/python2.6/site-
packages/resource_management/libraries/script/script.py", line 219, in execute  
method(env)  
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/hook.py", line 33, in hook  
install_repos()  
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/repo_initialization.py", line 59, in install_repos  
_alter_repo("create", params.repo_info, template)  
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/repo_initialization.py", line 50, in _alter_repo  
components = ubuntu_components, # ubuntu specific  
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line
154, in __init__  
self.env.run()  
File "/usr/lib/python2.6/site-
packages/resource_management/core/environment.py", line 152, in run  
self.run_action(resource, action)  
File "/usr/lib/python2.6/site-
packages/resource_management/core/environment.py", line 118, in run_action  
provider_action()  
File "/usr/lib/python2.6/site-
packages/resource_management/libraries/providers/repository.py", line 110, in
action_create  
retcode, out = checked_call(update_cmd_formatted, sudo=True, quiet=False)  
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 70, in inner  
result = function(command, **kwargs)  
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 92, in checked_call  
tries=tries, try_sleep=try_sleep)  
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 140, in _call_wrapper  
result = _call(command, **kwargs_copy)  
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 291, in _call  
raise Fail(err_msg)  
resource_management.core.exceptions.Fail: Execution of 'apt-get update <del>qq
-o Dir::Etc::sourcelist=sources.list.d/HDP.list -o
Dir::Etc::sourceparts=</del> -o APT::Get::List-Cleanup=0' returned 100. W: GPG
error: <http://public-repo-1.hortonworks.com> HDP InRelease: The following
signatures couldn't be verified because the public key is not available:
NO_PUBKEY B9733A7A07513CAD  
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily
unavailable)  
E: Unable to lock the administration directory (/var/lib/dpkg/), is another
process using it?

BUSINESS IMPACT  
MSFT Research will not manage their cluster with Ambari if this cannot be
fixed by the end of November.

EXPECTED  
Ambari retries installations for some period of time

ACTUAL  
Ambari fails

SUPPORT ANALYSIS  
I created a simple program based on the code at
<http://beej.us/guide/bgipc/output/html/multipage/flocking.html> to write lock
/var/lib/dpkg/lock on command, and then attempted a component install on a new
node in a cluster. The install failed. After removing the lock, the
installation succeeded. This is easily reproduced using a simple C program on
a target node.


Diffs
-----

  ambari-agent/src/test/python/resource_management/TestPackageResource.py 18b2d00 
  ambari-common/src/main/python/resource_management/core/providers/package/__init__.py 7e532bc 
  ambari-common/src/main/python/resource_management/core/providers/package/apt.py ddd6952 
  ambari-common/src/main/python/resource_management/core/providers/package/zypper.py 3ff3dfd 
  ambari-common/src/main/python/resource_management/core/resources/packaging.py 1ca88af 

Diff: https://reviews.apache.org/r/40600/diff/


Testing
-------

mvn clean test


Thanks,

Andrew Onischuk


Re: Review Request 40600: Service or component install fails when a non-ambari apt-get command is running

Posted by Dmitro Lisnichenko <dl...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40600/#review107598
-----------------------------------------------------------

Ship it!


Ship It!

- Dmitro Lisnichenko


On Nov. 23, 2015, 5:23 p.m., Andrew Onischuk wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40600/
> -----------------------------------------------------------
> 
> (Updated Nov. 23, 2015, 5:23 p.m.)
> 
> 
> Review request for Ambari and Dmitro Lisnichenko.
> 
> 
> Bugs: AMBARI-14017
>     https://issues.apache.org/jira/browse/AMBARI-14017
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> PROBLEM  
> User runs "apt-get check" via
> a cron job on their servers to check for broken dependencies. They report this
> command may take up to two minutes to complete on various nodes in their
> cluster. This command locks the package database via a write lock on
> /var/lib/dpkg/lock. During that interval, if Ambari is commanded to install a
> new component or perform other maintenance tasks on a cluster node that
> require access to the package database, the command will fail. Since the apt-
> get check is cron, apparently with some frequency, this represents a problem
> for ongoing maintenance, especially in large clusters.
> 
> It would be desirable if ambari and/or the agent were more fault tolerant of
> locks on the package database.
> 
> The stack trace at failure follows  
> Traceback (most recent call last):  
> File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
> INSTALL/scripts/hook.py", line 37, in <module>  
> BeforeInstallHook().execute()  
> File "/usr/lib/python2.6/site-
> packages/resource_management/libraries/script/script.py", line 219, in execute  
> method(env)  
> File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
> INSTALL/scripts/hook.py", line 33, in hook  
> install_repos()  
> File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
> INSTALL/scripts/repo_initialization.py", line 59, in install_repos  
> _alter_repo("create", params.repo_info, template)  
> File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
> INSTALL/scripts/repo_initialization.py", line 50, in _alter_repo  
> components = ubuntu_components, # ubuntu specific  
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line
> 154, in __init__  
> self.env.run()  
> File "/usr/lib/python2.6/site-
> packages/resource_management/core/environment.py", line 152, in run  
> self.run_action(resource, action)  
> File "/usr/lib/python2.6/site-
> packages/resource_management/core/environment.py", line 118, in run_action  
> provider_action()  
> File "/usr/lib/python2.6/site-
> packages/resource_management/libraries/providers/repository.py", line 110, in
> action_create  
> retcode, out = checked_call(update_cmd_formatted, sudo=True, quiet=False)  
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
> line 70, in inner  
> result = function(command, **kwargs)  
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
> line 92, in checked_call  
> tries=tries, try_sleep=try_sleep)  
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
> line 140, in _call_wrapper  
> result = _call(command, **kwargs_copy)  
> File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
> line 291, in _call  
> raise Fail(err_msg)  
> resource_management.core.exceptions.Fail: Execution of 'apt-get update <del>qq
> -o Dir::Etc::sourcelist=sources.list.d/HDP.list -o
> Dir::Etc::sourceparts=</del> -o APT::Get::List-Cleanup=0' returned 100. W: GPG
> error: <http://public-repo-1.hortonworks.com> HDP InRelease: The following
> signatures couldn't be verified because the public key is not available:
> NO_PUBKEY B9733A7A07513CAD  
> E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily
> unavailable)  
> E: Unable to lock the administration directory (/var/lib/dpkg/), is another
> process using it?
> 
> IMPACT  
> User will not manage their cluster with Ambari if this cannot be
> fixed by the end of November.
> 
> EXPECTED  
> Ambari retries installations for some period of time
> 
> ACTUAL  
> Ambari fails
> 
> ANALYSIS  
> I created a simple program based on the code at
> <http://beej.us/guide/bgipc/output/html/multipage/flocking.html> to write lock
> /var/lib/dpkg/lock on command, and then attempted a component install on a new
> node in a cluster. The install failed. After removing the lock, the
> installation succeeded. This is easily reproduced using a simple C program on
> a target node.
> 
> 
> Diffs
> -----
> 
>   ambari-agent/src/test/python/resource_management/TestPackageResource.py 18b2d00 
>   ambari-common/src/main/python/resource_management/core/providers/package/__init__.py 7e532bc 
>   ambari-common/src/main/python/resource_management/core/providers/package/apt.py ddd6952 
>   ambari-common/src/main/python/resource_management/core/providers/package/zypper.py 3ff3dfd 
>   ambari-common/src/main/python/resource_management/core/resources/packaging.py 1ca88af 
> 
> Diff: https://reviews.apache.org/r/40600/diff/
> 
> 
> Testing
> -------
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>


Re: Review Request 40600: Service or component install fails when a non-ambari apt-get command is running

Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40600/
-----------------------------------------------------------

(Updated Nov. 23, 2015, 3:23 p.m.)


Review request for Ambari and Dmitro Lisnichenko.


Bugs: AMBARI-14017
    https://issues.apache.org/jira/browse/AMBARI-14017


Repository: ambari


Description (updated)
-------

PROBLEM  
User runs "apt-get check" via
a cron job on their servers to check for broken dependencies. They report this
command may take up to two minutes to complete on various nodes in their
cluster. This command locks the package database via a write lock on
/var/lib/dpkg/lock. During that interval, if Ambari is commanded to install a
new component or perform other maintenance tasks on a cluster node that
require access to the package database, the command will fail. Since the apt-
get check is cron, apparently with some frequency, this represents a problem
for ongoing maintenance, especially in large clusters.

It would be desirable if ambari and/or the agent were more fault tolerant of
locks on the package database.

The stack trace at failure follows  
Traceback (most recent call last):  
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/hook.py", line 37, in <module>  
BeforeInstallHook().execute()  
File "/usr/lib/python2.6/site-
packages/resource_management/libraries/script/script.py", line 219, in execute  
method(env)  
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/hook.py", line 33, in hook  
install_repos()  
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/repo_initialization.py", line 59, in install_repos  
_alter_repo("create", params.repo_info, template)  
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-
INSTALL/scripts/repo_initialization.py", line 50, in _alter_repo  
components = ubuntu_components, # ubuntu specific  
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line
154, in __init__  
self.env.run()  
File "/usr/lib/python2.6/site-
packages/resource_management/core/environment.py", line 152, in run  
self.run_action(resource, action)  
File "/usr/lib/python2.6/site-
packages/resource_management/core/environment.py", line 118, in run_action  
provider_action()  
File "/usr/lib/python2.6/site-
packages/resource_management/libraries/providers/repository.py", line 110, in
action_create  
retcode, out = checked_call(update_cmd_formatted, sudo=True, quiet=False)  
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 70, in inner  
result = function(command, **kwargs)  
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 92, in checked_call  
tries=tries, try_sleep=try_sleep)  
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 140, in _call_wrapper  
result = _call(command, **kwargs_copy)  
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py",
line 291, in _call  
raise Fail(err_msg)  
resource_management.core.exceptions.Fail: Execution of 'apt-get update <del>qq
-o Dir::Etc::sourcelist=sources.list.d/HDP.list -o
Dir::Etc::sourceparts=</del> -o APT::Get::List-Cleanup=0' returned 100. W: GPG
error: <http://public-repo-1.hortonworks.com> HDP InRelease: The following
signatures couldn't be verified because the public key is not available:
NO_PUBKEY B9733A7A07513CAD  
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily
unavailable)  
E: Unable to lock the administration directory (/var/lib/dpkg/), is another
process using it?

IMPACT  
User will not manage their cluster with Ambari if this cannot be
fixed by the end of November.

EXPECTED  
Ambari retries installations for some period of time

ACTUAL  
Ambari fails

ANALYSIS  
I created a simple program based on the code at
<http://beej.us/guide/bgipc/output/html/multipage/flocking.html> to write lock
/var/lib/dpkg/lock on command, and then attempted a component install on a new
node in a cluster. The install failed. After removing the lock, the
installation succeeded. This is easily reproduced using a simple C program on
a target node.


Diffs
-----

  ambari-agent/src/test/python/resource_management/TestPackageResource.py 18b2d00 
  ambari-common/src/main/python/resource_management/core/providers/package/__init__.py 7e532bc 
  ambari-common/src/main/python/resource_management/core/providers/package/apt.py ddd6952 
  ambari-common/src/main/python/resource_management/core/providers/package/zypper.py 3ff3dfd 
  ambari-common/src/main/python/resource_management/core/resources/packaging.py 1ca88af 

Diff: https://reviews.apache.org/r/40600/diff/


Testing
-------

mvn clean test


Thanks,

Andrew Onischuk