Maintainer scripts mishandle /var/cache/bind permissions

Bug #1086775 reported by Alex Bligh
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
bind9 (Debian)
Fix Released
Unknown
bind9 (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

Affects: 1:9.7.0.dfsg.P1-1ubuntu0.8, 1:9.8.1.dfsg.P1-4ubuntu0.4, 1:9.8.4.dfsg-1ubuntu1.

bind9.postinst only sets permissions on
/var/cache/bind on a fresh install. When the bind9 package is removed
but not purged, /var/cache/bind is removed, but /etc/bind is left alone
(as expected). When the bind9 package is reinstalled from this state,
the postinst fails to correct the default 755 permissions on
/var/cache/bind.

This is particularly a problem for users upgrading from Lucid, since this
situation causes 100% CPU usage due to bug 1038199.

Steps to reproduce:

1. Start with a Lucid system
2. apt-get install bind9
3. apt-get remove bind9
4. apt-get install bind9

Note broken permissions in /var/cache/bind.

This isn't directly reproducible in Raring because files are now
left behind in /var/cache/bind causing /var/cache/bind to not be removed
when the package is removed (is this a separate bug?)

However, if from Lucid you then do:

5. do-release-upgrade

Then the problem propagates to Raring, and you'll see bug 1038199 (100% CPU usage).

Workaround:

# chown root.bind /var/cache/bind
# chmod 775 /var/cache/bind
# service bind9 restart

Logs from the upgraded machine (see 'working directory not writeable' and 'permission denied')

05-Dec-2012 12:23:35.719 found 2 CPUs, using 2 worker threads
05-Dec-2012 12:23:35.720 using up to 4096 sockets
05-Dec-2012 12:23:35.726 loading configuration from '/etc/bind/named.conf'
05-Dec-2012 12:23:35.727 reading built-in trusted keys from file '/etc/bind/bind.keys'
05-Dec-2012 12:23:35.727 using default UDP/IPv4 port range: [1024, 65535]
05-Dec-2012 12:23:35.728 using default UDP/IPv6 port range: [1024, 65535]
05-Dec-2012 12:23:35.729 listening on IPv6 interfaces, port 53
05-Dec-2012 12:23:35.731 listening on IPv4 interface lo, 127.0.0.1#53
05-Dec-2012 12:23:35.732 listening on IPv4 interface eth0, 10.40.0.5#53
05-Dec-2012 12:23:35.734 listening on IPv4 interface eth1, 10.157.128.1#53
05-Dec-2012 12:23:35.735 listening on IPv4 interface eth1, 10.161.208.1#53
05-Dec-2012 12:23:35.736 listening on IPv4 interface eth0.60, 10.157.16.12#53
05-Dec-2012 12:23:35.738 generating session key for dynamic DNS
05-Dec-2012 12:23:35.738 sizing zone task pool based on 7 zones
05-Dec-2012 12:23:35.744 using built-in root key for view _default
05-Dec-2012 12:23:35.744 set up managed keys zone for view _default, file 'managed-keys.bind'
05-Dec-2012 12:23:35.744 Warning: 'empty-zones-enable/disable-empty-zone' not set: disabling RFC 1918 empty zones
05-Dec-2012 12:23:35.744 automatic empty zone: 254.169.IN-ADDR.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: 2.0.192.IN-ADDR.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: 100.51.198.IN-ADDR.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: 113.0.203.IN-ADDR.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: 255.255.255.255.IN-ADDR.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: 0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.IP6.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.IP6.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: D.F.IP6.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: 8.E.F.IP6.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: 9.E.F.IP6.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: A.E.F.IP6.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: B.E.F.IP6.ARPA
05-Dec-2012 12:23:35.744 automatic empty zone: 8.B.D.0.1.0.0.2.IP6.ARPA
05-Dec-2012 12:23:35.749 command channel listening on 127.0.0.1#953
05-Dec-2012 12:23:35.749 command channel listening on ::1#953
05-Dec-2012 12:23:35.749 the working directory is not writable
05-Dec-2012 12:23:35.749 ignoring config file logging statement due to -g option
05-Dec-2012 12:23:35.750 zone 0.in-addr.arpa/IN: loaded serial 1
05-Dec-2012 12:23:35.750 zone 157.10.in-addr.arpa/IN: loaded serial 1
05-Dec-2012 12:23:35.751 zone 127.in-addr.arpa/IN: loaded serial 1
05-Dec-2012 12:23:35.752 zone 255.in-addr.arpa/IN: loaded serial 1
05-Dec-2012 12:23:35.753 zone extility.install/IN: loaded serial 1300877104
05-Dec-2012 12:23:35.754 zone localhost/IN: loaded serial 2
05-Dec-2012 12:23:35.754 managed-keys-zone ./IN: loading from master file managed-keys.bind failed: file not found
05-Dec-2012 12:23:35.754 managed-keys.bind.jnl: create: permission denied
05-Dec-2012 12:23:35.754 managed-keys-zone ./IN: sync_keyzone:dns_journal_open -> unexpected error

description: updated
Revision history for this message
Robie Basak (racb) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better.

I've not been able to reproduce this when upgrading from Lucid to Precise (1:9.7.0.dfsg.P1-1ubuntu0.8 to 1:9.8.1.dfsg.P1-4ubuntu0.4). /var/cache/bind had the correct (775) permissions. If I remove its contents and change the permissions to 755, then I do see the 100% CPU usage and the error in the log file that you've reported. But what I don't see is how to get the permissions to the erroneous 755 in the first place - simply installing bind9 in Lucid and upgrading to Precise doesn't seem to do it.

Are you sure that the permissions weren't already wrong due to a local misconfiguration before you upgraded?

Marking as Incomplete for now. If you manage to figure out how to reproduce this problem, please comment and change the bug status back to New.

Changed in bind9 (Ubuntu):
status: New → Incomplete
Revision history for this message
Alex Bligh (ubuntu-alex-org) wrote :

The server concerns was automatically installed from a CD-ROM built from Ubuntu sources and (in respect of bind) it has only had automatic updates run on it. I am very confident it was not operator error.

It was upgraded with 'do-release-upgrade'.

I can tell you I am not the only person experiencing this. See for instance:
  http://ubuntuforums.org/showthread.php?t=1971471
  https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1038199 (same root cause I'm guessing)

I would have thought that given >1 people are seeing this, a chmod in the postinst file would do no harm.

Revision history for this message
Robie Basak (racb) wrote :

With as many users as we have, a common misconfiguration can lead to a number of reports. I'd have expected dozens of reports or more by now if this were a systemic upgrade problem. Also note that one of the reports you linked to has root group ownership too, which is inconsistent with a single root cause.

I've looked at the ways the bind9 maintainer scripts call chmod, and don't see how a group write permission could get lost:

$ grep chmod /var/lib/dpkg/info/*bind*
/var/lib/dpkg/info/bind9.postinst: chmod 775 /var/lib/bind
/var/lib/dpkg/info/bind9.postinst: chmod g+s /etc/bind
/var/lib/dpkg/info/bind9.postinst: chmod g+r /etc/bind/rndc.key /etc/bind/named.conf* || true
/var/lib/dpkg/info/bind9.postinst: chmod g+rwx /var/run/named /var/cache/bind

I've checked the maintainer scripts this way across Hardy, Lucid and Precise. It looks to me that the existing postinst intentionally avoids doing the chmod except in a particular circumstance which I presume is for upgrading from a specific previous version (presumably prior to Hardy). I'd like to understand the root cause before I'm comfortable pushing to change this, and there is a trivial workaround for those affected.

So steps to reproduce the problem would really help!

Revision history for this message
Alex Bligh (ubuntu-alex-org) wrote :

Well I'm pretty sure the problem is this. I've just gone to another (unconnected) Lucid box, and:

root@extility-developers:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 10.04.4 LTS
Release: 10.04
Codename: lucid
root@extility-developers:~# ls -ln /etc/bind/rndc.key
-rw-r----- 1 103 108 77 2012-06-14 14:23 /etc/bind/rndc.key

See rndc.key is owned by UID 103, which is not equal to 0. So the Precise postinst script does not do the chmod.

You may not have received reports because bind actually works, just uses high CPU.

Revision history for this message
Robie Basak (racb) wrote : Re: [Bug 1086775] Re: bind9 uses high CPU after lucid->precise upgrade

On Wed, Dec 05, 2012 at 05:48:18PM -0000, Alex Bligh wrote:
> See rndc.key is owned by UID 103, which is not equal to 0. So the
> Precise postinst script does not do the chmod.

This is what I would expect, because the permissions on /var/cache/bind
should already be correct, and maintainer scripts generally try not to
interfere with anything the administrator might have done.

Revision history for this message
Alex Bligh (ubuntu-alex-org) wrote : Re: bind9 uses high CPU after lucid->precise upgrade

OK so my working hypothesis is this. On Lucid /var/cache/bind is created simply by virtue of it being a directory within the package (see the bind9.list file). The group write permission is added by the postinst. If the Lucid package was installed, then removed, then installed again, the following happens:

1. the first install would create /var/cache/bind with whatever ownership is in the package, and also /etc/bind/rndc.key with root ownership. The postinst thens runs and fixes the group write permission on /var/cache/bind.

2. the removal would delete /var/cache/bind as it is not a conffile, but not /etc/bind/rndc.key

3. the second install would create /var/cache/bind again with (possibly) the wrong permissions, and the postinst script would not fix it.

This probably doesn't go wrong in Lucid because nothing writes to the cache directory and/or bind survives without the cache. It's certainly empty here on our Lucid boxes pre upgrade to Precise. But the Precise upgrade requires to write there, and then dies.

The above would happen (AFAICT) if *ANY* version ever released of the Lucid bind9.deb had broken permissions, as subsequent upgrades would not fix it.

The problem with only fixing permissions if some rather random file in /etc/ is owned by root is it is inherently fragile. Is there any reason why the bind cache directory should ever not be writeable by the group that owns it?

Revision history for this message
Alex Bligh (ubuntu-alex-org) wrote :
Download full text (3.1 KiB)

To follow this up, the .deb at least on Lucid does NOT have the write permission set.

amb@nimrod-ubuntu:~/bind-test$ dpkg -c bind9_9.7.0.dfsg.P1-1ubuntu0.8_amd64.deb | fgrep cache
drwxr-xr-x root/root 0 2012-10-09 14:13 ./var/cache/
drwxr-xr-x root/root 0 2012-10-09 14:13 ./var/cache/bind/

I've tried this on a pristine Precise box and it doesn't go wrong because Precise does not remove /var/cache/bind as it is populated (unlike on at least some Lucid installs). However, if I manually remove the cache directory, it does go wrong:

root@adamant:~# dpkg --list bind9
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Description
+++-==============-==============-============================================
ii bind9 1:9.8.1.dfsg.P Internet Domain Name Server
root@adamant:~# ls -lnd /var/cache/bind /etc/bind/rndc.key
-rw-r----- 1 103 108 77 Dec 3 20:56 /etc/bind/rndc.key
drwxrwxr-x 2 0 108 4096 Dec 4 21:00 /var/cache/bind
root@adamant:~# aptitude remove bind9
The following packages will be REMOVED:
  bind9
0 packages upgraded, 0 newly installed, 1 to remove and 0 not upgraded.
Need to get 0 B of archives. After unpacking 963 kB will be freed.
(Reading database ... 47095 files and directories currently installed.)
Removing bind9 ...
 * Stopping domain name service... bind9
waiting for pid 859 to die
   ...done.
Processing triggers for ufw ...
Processing triggers for ureadahead ...
ureadahead will be reprofiled on next reboot
Processing triggers for man-db ...

root@adamant:~# ls -lnd /var/cache/bind /etc/bind/rndc.key
-rw-r----- 1 103 108 77 Dec 3 20:56 /etc/bind/rndc.key
drwxrwxr-x 2 0 108 4096 Dec 5 19:13 /var/cache/bind
root@adamant:~# ls -la /var/cache/bind
total 16
drwxrwxr-x 2 root bind 4096 Dec 5 19:13 .
drwxr-xr-x 8 root root 4096 Dec 3 20:54 ..
-rw-r--r-- 1 bind bind 698 Dec 4 21:00 managed-keys.bind
-rw-r--r-- 1 bind bind 512 Dec 4 21:00 managed-keys.bind.jnl
root@adamant:~# rm -rf /var/cache/bind
root@adamant:~# aptitude install bind9
The following NEW packages will be installed:
  bind9
0 packages upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 343 kB of archives. After unpacking 963 kB will be used.
Get: 1 http://gb.archive.ubuntu.com/ubuntu/ precise-updates/main bind9 amd64 1:9.8.1.dfsg.P1-4ubuntu0.4 [343 kB]
Fetched 343 kB in 0s (621 kB/s)
Preconfiguring packages ...
Selecting previously unselected package bind9.
(Reading database ... 47062 files and directories currently installed.)
Unpacking bind9 (from .../bind9_1%3a9.8.1.dfsg.P1-4ubuntu0.4_amd64.deb) ...
Processing triggers for man-db ...
Processing triggers for ureadahead ...
Processing triggers for ufw ...
Setting up bind9 (1:9.8.1.dfsg.P1-4ubuntu0.4) ...
 * Starting domain name service... bind9
   ...done.

root@adamant:~# ls -lnd /var/cache/bind /etc/bind/rndc.key
-rw-r----- 1 103 108 77 Dec 3 20:56 /etc/bind/rndc.key
drwxr-xr-x 2 0 0 4096 Oct 9 14:06 /v...

Read more...

Revision history for this message
Robie Basak (racb) wrote :

Thanks for your insight Alex. I've managed to reproduce this now, with the following steps:

On Lucid:
 1. sudo apt-get install bind9
 2. sudo apt-get remove bind9
 # this removes /var/cache/bind but leaves /etc/bind9/rndc.key
 3. sudo apt-get install bind9
 # Now the postinst doesn't fix /var/cache/bind, but on Lucid nobody will notice this problem
 4. sudo do-release-upgrade
 # bind now uses /var/cache/bind/managed-keys.bind and the problem occurs

After the upgrade to Precise, bind9 is in the situation you described (permissions on /var/cache/bind wrong), with 100% CPU consumption.

I couldn't reproduce the problem directly on Precise because on package removal /var/cache/bind/managed-keys.bind is left behind and so /var/cache/bind never gets removed (I think this is a separate bug in itself).

It seems likely to me that this issue will affect Debian also, so next I will test this and file a bug report in Debian as needed, so that we can coordinate a fix.

Changed in bind9 (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → Medium
Revision history for this message
Alex Bligh (ubuntu-alex-org) wrote :

Robie,

No problem - I'm just glad I wasn't imagining it.

I agree the 100% CPU problem can't be reproduced on precise.

To be honest I don't quite understand why /var/cache/bind isn't in /var/run (given it's a cache) but I may be wrong about that.

Alex

Revision history for this message
Robie Basak (racb) wrote :

There are two problems here. First is that the bind9.postinst fails to set permissions correctly in some circumstances, and second that without the permissions set correctly, you get 100% CPU usage.

The 100% CPU usage problem is already in bug 1038199, so we can track that there. That leaves this bug to track the postinst /var/cache/bind permissions problem.

description: updated
summary: - bind9 uses high CPU after lucid->precise upgrade
+ Maintainer scripts mishandle /var/cache/bind permissions
Changed in bind9 (Debian):
status: Unknown → New
description: updated
Changed in bind9 (Debian):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.