You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@whirr.apache.org by Paul Baclace <pa...@gmail.com> on 2011/10/03 22:22:47 UTC

non-deterministic "Could not get lock /var/lib/dpkg/lock"

Two runs of whirr on EC2 yesterday randomly failed to install Hadoop 
components.  First it occurred on the master node, but when it occurred 
in one slave and not another, I could find the diff of the /tmp/logs/ 
from jclouds.  In a third run, everything worked fine.  Same scripts 
driving whirr, same AMI, same number of nodes, same region, etc. 
Snippets of /tmp/logs/stderr.log shown below indicate that apt-get 
update had "Could not get lock /var/lib/dpkg/lock" on one slave, but not 
another.

This is a serious reliability issue.  What is non-deterministic here?

Paul

------------ slave 1 -------------------
+ register_cloudera_repo
+ which dpkg
+ cat
+ curl -s http://archive.cloudera.com/debian/archive.key
+ sudo apt-key add -
+ sudo apt-get update
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource 
temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is 
another process using it?
+ which dpkg
+ apt-get update
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource 
temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is 
another process using it?
+ apt-get -y install hadoop-0.20

-------------- slave 2 ---------------
+ register_cloudera_repo
+ which dpkg
+ cat
+ curl -s http://archive.cloudera.com/debian/archive.key
+ sudo apt-key add -
+ sudo apt-get update
+ which dpkg
+ apt-get update
+ apt-get -y install hadoop-0.20
dpkg-preconfigure: unable to re-open stdin:
+ cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.dist
+ update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf 
/etc/hadoop-0.20/conf.dist 90
+ install_cdh_hbase -c aws-ec2 -u 
http://apache.cs.utah.edu/hbase/hbase-0.90.3/hbase-0.90.3.tar.gz

-------------

Re: non-deterministic "Could not get lock /var/lib/dpkg/lock"

Posted by Andrei Savu <sa...@gmail.com>.

I have created the following issue for this:
https://issues.apache.org/jira/browse/WHIRR-501

On Fri, Feb 3, 2012 at 8:24 PM, Andrei Savu <sa...@gmail.com> wrote:

> Good catch Karel! I have tried to investigate this in the past but I have
> never considered that it may be a race condition with a cron job (most of
> the synchronisation tests we've added are designed to prove that this is
> not a condition triggered by Whirr).
>
> What if we stop the crond service while running the install/configure
> scripts?
> http://www.cyberciti.biz/faq/howto-linux-unix-start-restart-cron/
>
>
>> In my opinion, as much of the installation/configuration steps should
>> be done using a config management tool (puppet/chef).
>>
>
> Totally agree + we have the needed infrastructure for this.
>
>
>> Once the configuration is published to each node you can trigger
>> puppet/chef it as much as you like, and eventually you should reach a
>> good state. Running the complete whirr-generated script(s) multiple
>> times is going to be slower and much more error prone.
>>
>
> + it's hard to make retry-friendly bash scripts.
>
>
>>
>> Regards,
>> Karel
>>
>> On Mon, Oct 3, 2011 at 10:22 PM, Paul Baclace <pa...@gmail.com>
>> wrote:
>> > Two runs of whirr on EC2 yesterday randomly failed to install Hadoop
>> > components.  First it occurred on the master node, but when it occurred
>> in
>> > one slave and not another, I could find the diff of the /tmp/logs/ from
>> > jclouds.  In a third run, everything worked fine.  Same scripts driving
>> > whirr, same AMI, same number of nodes, same region, etc. Snippets of
>> > /tmp/logs/stderr.log shown below indicate that apt-get update had
>> "Could not
>> > get lock /var/lib/dpkg/lock" on one slave, but not another.
>> >
>> > This is a serious reliability issue.  What is non-deterministic here?
>> >
>> > Paul
>> >
>> > ------------ slave 1 -------------------
>> > + register_cloudera_repo
>> > + which dpkg
>> > + cat
>> > + curl -s http://archive.cloudera.com/debian/archive.key
>> > + sudo apt-key add -
>> > + sudo apt-get update
>> > E: Could not get lock /var/lib/dpkg/lock - open (11: Resource
>> temporarily
>> > unavailable)
>> > E: Unable to lock the administration directory (/var/lib/dpkg/), is
>> another
>> > process using it?
>> > + which dpkg
>> > + apt-get update
>> > E: Could not get lock /var/lib/dpkg/lock - open (11: Resource
>> temporarily
>> > unavailable)
>> > E: Unable to lock the administration directory (/var/lib/dpkg/), is
>> another
>> > process using it?
>> > + apt-get -y install hadoop-0.20
>> >
>> > -------------- slave 2 ---------------
>> > + register_cloudera_repo
>> > + which dpkg
>> > + cat
>> > + curl -s http://archive.cloudera.com/debian/archive.key
>> > + sudo apt-key add -
>> > + sudo apt-get update
>> > + which dpkg
>> > + apt-get update
>> > + apt-get -y install hadoop-0.20
>> > dpkg-preconfigure: unable to re-open stdin:
>> > + cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.dist
>> > + update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf
>> > /etc/hadoop-0.20/conf.dist 90
>> > + install_cdh_hbase -c aws-ec2 -u
>> > http://apache.cs.utah.edu/hbase/hbase-0.90.3/hbase-0.90.3.tar.gz
>> >
>> > -------------
>>
>>
>>
>> --
>> Karel Vervaeke
>> http://outerthought.org/
>> Open Source Content Applications
>> Makers of Kauri, Daisy CMS and Lily
>>
>
>

Re: non-deterministic "Could not get lock /var/lib/dpkg/lock"

Posted by Andrei Savu <sa...@gmail.com>.

Good catch Karel! I have tried to investigate this in the past but I have
never considered that it may be a race condition with a cron job (most of
the synchronisation tests we've added are designed to prove that this is
not a condition triggered by Whirr).

What if we stop the crond service while running the install/configure
scripts?
http://www.cyberciti.biz/faq/howto-linux-unix-start-restart-cron/


> In my opinion, as much of the installation/configuration steps should
> be done using a config management tool (puppet/chef).
>

Totally agree + we have the needed infrastructure for this.


> Once the configuration is published to each node you can trigger
> puppet/chef it as much as you like, and eventually you should reach a
> good state. Running the complete whirr-generated script(s) multiple
> times is going to be slower and much more error prone.
>

+ it's hard to make retry-friendly bash scripts.


>
> Regards,
> Karel
>
> On Mon, Oct 3, 2011 at 10:22 PM, Paul Baclace <pa...@gmail.com>
> wrote:
> > Two runs of whirr on EC2 yesterday randomly failed to install Hadoop
> > components.  First it occurred on the master node, but when it occurred
> in
> > one slave and not another, I could find the diff of the /tmp/logs/ from
> > jclouds.  In a third run, everything worked fine.  Same scripts driving
> > whirr, same AMI, same number of nodes, same region, etc. Snippets of
> > /tmp/logs/stderr.log shown below indicate that apt-get update had "Could
> not
> > get lock /var/lib/dpkg/lock" on one slave, but not another.
> >
> > This is a serious reliability issue.  What is non-deterministic here?
> >
> > Paul
> >
> > ------------ slave 1 -------------------
> > + register_cloudera_repo
> > + which dpkg
> > + cat
> > + curl -s http://archive.cloudera.com/debian/archive.key
> > + sudo apt-key add -
> > + sudo apt-get update
> > E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily
> > unavailable)
> > E: Unable to lock the administration directory (/var/lib/dpkg/), is
> another
> > process using it?
> > + which dpkg
> > + apt-get update
> > E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily
> > unavailable)
> > E: Unable to lock the administration directory (/var/lib/dpkg/), is
> another
> > process using it?
> > + apt-get -y install hadoop-0.20
> >
> > -------------- slave 2 ---------------
> > + register_cloudera_repo
> > + which dpkg
> > + cat
> > + curl -s http://archive.cloudera.com/debian/archive.key
> > + sudo apt-key add -
> > + sudo apt-get update
> > + which dpkg
> > + apt-get update
> > + apt-get -y install hadoop-0.20
> > dpkg-preconfigure: unable to re-open stdin:
> > + cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.dist
> > + update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf
> > /etc/hadoop-0.20/conf.dist 90
> > + install_cdh_hbase -c aws-ec2 -u
> > http://apache.cs.utah.edu/hbase/hbase-0.90.3/hbase-0.90.3.tar.gz
> >
> > -------------
>
>
>
> --
> Karel Vervaeke
> http://outerthought.org/
> Open Source Content Applications
> Makers of Kauri, Daisy CMS and Lily
>

Re: non-deterministic "Could not get lock /var/lib/dpkg/lock"

Posted by Karel Vervaeke <ka...@outerthought.org>.

This bites me regularly as well.
I suspect this is caused by cron jobs - there are several cron jobs
invoking dpkg/apt/aptitude, all of which take the dpkg lock.

I'm seeing this on byon nodes, so a quick hack is to disable these
cron jobs. (e.g. Simply removing these files should do the trick):
"/etc/cron.daily/standard",
"/etc/cron.daily/dpkg",
"/etc/cron.daily/man-db",
"/etc/cron.daily/apt",
"/etc/cron.daily/aptitude",
"/etc/cron.weekly/man-db"

For ec2 you would have to create your own images without those cron jobs (blech)

This kind of problem (and others, such as failed downloads and other
randomness) can never by completely avoided.
In my opinion, as much of the installation/configuration steps should
be done using a config management tool (puppet/chef).
Once the configuration is published to each node you can trigger
puppet/chef it as much as you like, and eventually you should reach a
good state. Running the complete whirr-generated script(s) multiple
times is going to be slower and much more error prone.

Regards,
Karel

On Mon, Oct 3, 2011 at 10:22 PM, Paul Baclace <pa...@gmail.com> wrote:
> Two runs of whirr on EC2 yesterday randomly failed to install Hadoop
> components.  First it occurred on the master node, but when it occurred in
> one slave and not another, I could find the diff of the /tmp/logs/ from
> jclouds.  In a third run, everything worked fine.  Same scripts driving
> whirr, same AMI, same number of nodes, same region, etc. Snippets of
> /tmp/logs/stderr.log shown below indicate that apt-get update had "Could not
> get lock /var/lib/dpkg/lock" on one slave, but not another.
>
> This is a serious reliability issue.  What is non-deterministic here?
>
> Paul
>
> ------------ slave 1 -------------------
> + register_cloudera_repo
> + which dpkg
> + cat
> + curl -s http://archive.cloudera.com/debian/archive.key
> + sudo apt-key add -
> + sudo apt-get update
> E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily
> unavailable)
> E: Unable to lock the administration directory (/var/lib/dpkg/), is another
> process using it?
> + which dpkg
> + apt-get update
> E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily
> unavailable)
> E: Unable to lock the administration directory (/var/lib/dpkg/), is another
> process using it?
> + apt-get -y install hadoop-0.20
>
> -------------- slave 2 ---------------
> + register_cloudera_repo
> + which dpkg
> + cat
> + curl -s http://archive.cloudera.com/debian/archive.key
> + sudo apt-key add -
> + sudo apt-get update
> + which dpkg
> + apt-get update
> + apt-get -y install hadoop-0.20
> dpkg-preconfigure: unable to re-open stdin:
> + cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.dist
> + update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf
> /etc/hadoop-0.20/conf.dist 90
> + install_cdh_hbase -c aws-ec2 -u
> http://apache.cs.utah.edu/hbase/hbase-0.90.3/hbase-0.90.3.tar.gz
>
> -------------

-- 
Karel Vervaeke
http://outerthought.org/
Open Source Content Applications
Makers of Kauri, Daisy CMS and Lily