LXC + BTRFS: creating many containers simultaneously can cause system corruption

Bug #1214085 reported by Jay Taylor
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Using lxc with a btrfs filesystem can lead to system corruption and containers which are unable to boot when many containers are created and started simultaneously.

Packages involved:

    - btrfs-tools
    - ubuntu-lxc/daily from http://ppa.launchpad.net/ubuntu-lxc/daily/ubuntu

`lsb_release -rd`:

    ubuntu@ip-10-91-30-131:~$ lsb_release -rd
    Description: Ubuntu 12.04.2 LTS
    Release: 12.04

Here is the relevant discussion thread: http://<email address hidden>/msg05472.html

And here is a key excerpt regarding reproducing the problem:

The general system state is something like:
N containers already running happily
Launch N+ more containers in rapid succession (in parallell, not serially).

Here is a test script which closely reflects what my application is actually doing. It slowly launches 10 containers and then uses "&" to rapidly fork an additional 10 clone/start operations. I have it doing 2 cycles of this, and running the script many times (2-6 times) eventually triggers the problem.

test.sh:
#!/usr/bin/env bash

prefix=$1

test -z "${prefix}" && echo 'error: missing required parameter: prefix' 1>&2 && exit 1

path=/mnt

sudo lxc-destroy -n c1 2>/dev/null
sudo lxc-create -t ubuntu -B btrfs -n c1

for i in `seq 1 10`; do
    sudo lxc-clone -s -B btrfs -P $path -o c1 -n $prefix$i
    sudo lxc-start -d -n $prefix$i
done
for i in `seq 11 20`; do
    echo $(sudo lxc-clone -s -B btrfs -P $path -o c1 -n $prefix$i; sudo lxc-start -d -n $prefix$i) &
done

sleep 10

# Create even more.
for i in `seq 21 30`; do
    sudo lxc-clone -s -B btrfs -P $path -o c1 -n $prefix$i
    sudo lxc-start -d -n $prefix$i
done
for i in `seq 31 40`; do
    echo $(sudo lxc-clone -s -B btrfs -P $path -o c1 -n $prefix$i; sudo lxc-start -d -n $prefix$i) &
done

stop.sh:
#!/usr/bin/env bash

prefix=$1

test -z "${prefix}" && echo 'error: missing required parameter: prefix' 1>&2 && exit 1

sudo lxc-destroy -n c1;

for i in `seq 1 40`; do
    echo $(sudo lxc-stop -k -n $prefix$i; sudo lxc-destroy -n $prefix$i) &
done

bash ./test.sh x
bash ./test.sh y
bash ./test.sh z

If it doesn't manifest at first, try starting/stopping/destroying varying quantities of containers for several cycles. Eventually I consistently end up not ever getting ip addresses or being able to even successfully start a container:

x1 RUNNING - - NO
x10 RUNNING - - NO
x11 RUNNING - - NO
x12 RUNNING - - NO
x13 RUNNING - - NO
x14 RUNNING - - NO
x15 RUNNING - - NO
x16 RUNNING - - NO
x17 RUNNING - - NO
x18 RUNNING - - NO
x19 RUNNING - - NO
x2 RUNNING - - NO
x20 RUNNING - - NO
x21 RUNNING - - NO
x22 RUNNING - - NO
x23 RUNNING - - NO
x24 RUNNING - - NO
x25 RUNNING - - NO
x26 RUNNING - - NO
x27 RUNNING - - NO
x28 RUNNING - - NO
x29 RUNNING - - NO
x3 RUNNING - - NO
x30 RUNNING - - NO
x31 RUNNING - - NO
x32 RUNNING - - NO
x33 RUNNING - - NO
x34 RUNNING - - NO
x35 RUNNING - - NO
x36 RUNNING - - NO
x37 RUNNING - - NO
x38 RUNNING - - NO
x39 RUNNING - - NO
x4 RUNNING - - NO
x40 RUNNING - - NO
x5 RUNNING - - NO
x6 RUNNING - - NO
x7 RUNNING - - NO
x8 RUNNING - - NO
x9 RUNNING - - NO

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-40-virtual 3.2.0-40.64
ProcVersionSignature: Ubuntu 3.2.0-40.64-virtual 3.2.40
Uname: Linux 3.2.0-40-virtual x86_64
AcpiTables:

AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 Aug 14 20:40 seq
 crw-rw---T 1 root audio 116, 33 Aug 14 20:40 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu17.3
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Mon Aug 19 18:20:19 2013
Ec2AMI: ami-d0f89fb9
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1a
Ec2InstanceType: m1.large
Ec2Kernel: aki-88aa75e1
Ec2Ramdisk: unavailable
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
MarkForUpload: True
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=LABEL=cloudimg-rootfs ro console=hvc0
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-40-virtual N/A
 linux-backports-modules-3.2.0-40-virtual N/A
 linux-firmware 1.79.4
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
WifiSyslog:

Revision history for this message
Jay Taylor (jaytaylor) wrote :
description: updated
Jay Taylor (jaytaylor)
description: updated
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1214085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.11 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc6-saucy/

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Revision history for this message
Jay Taylor (jaytaylor) wrote :

Brad Figg: I'm happy to do this, but this is a server without a GUI, and the spawned w3m instance is unable to properly display the login form page.

Revision history for this message
Jay Taylor (jaytaylor) wrote :

Joseph Salisbury: I installed v3.11 RC2 and successfully reproduced the problem using the same technique. The result remains the same - after a series of simultaneous clone/start operations, LXC eventually fails to be able to start any additional containers.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Jay Taylor (jaytaylor) wrote :

Additional note: I've found the problem manifests more easily with the following modified test script (this one forks all LXC clone/start operations immediately):

test.sh:

    #!/usr/bin/env bash

    prefix=$1

    test -z "${prefix}" && echo 'error: missing required parameter: prefix' 1>&2 && exit 1

    path=/mnt/test

    sudo lxc-destroy -n c1 2>/dev/null
    sudo lxc-create -t ubuntu -B btrfs -n c1

    for i in `seq 1 20`; do
        echo $(sudo lxc-clone -s -B btrfs -P $path -o c1 -n $prefix$i; sudo lxc-start -d -n $prefix$i) &
    done

    sleep 10

    # Create even more.
    for i in `seq 21 40`; do
        echo $(sudo lxc-clone -s -B btrfs -P $path -o c1 -n $prefix$i; sudo lxc-start -d -n $prefix$i) &
    done

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

So it sounds like the bug still exists in the latest Mainline kernel, and it has been around since at lease the 3.2 kernel. Do you happen to know if there was a previous release that did not exhibit this bug?

Revision history for this message
Jay Taylor (jaytaylor) wrote :

I only started using lxc+btrfs in the past few months, so unfortunately I have no information about the behavior of any previous kernels.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

You mention you tested v3.11-rc2 in comment #5. Can you also test the latest mainline kernel, which is v3.11-rc6:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.11-rc6-saucy/

Revision history for this message
fedorowp (fedorowp) wrote :

On an 800 Mhz machine I can quite consistently reproduce startup failures and conflicting IPs with ext3 instead of btrfs with only two containers.

Additional Information:
A brand new install of Ubuntu 13.04 Server AMD64 with all updates applied.

$ sudo lxc-ls --fancy
NAME STATE IPV4 IPV6 AUTOSTART
------------------------------------------------
content RUNNING 10.0.3.2 - YES
database RUNNING 10.0.3.2 - YES

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.04
Release: 13.04
Codename: raring

$ uname -a
Linux server 3.8.0-29-generic #42-Ubuntu SMP Tue Aug 13 19:40:39 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

$ lxc-version
lxc version: 0.9.0

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Please open a new bug for your issue, using 'ubuntu-bug lxc' (though actually I am suspecting a dnsmasq bug). It may be the same issue as this bug, but many variables are different.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.