Server crashes on soft lockup

Bug #1679625 reported by t2d
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-lts-xenial (Ubuntu)
New
Undecided
Unassigned

Bug Description

Release: Ubuntu 14.04.5 LTS
Kernel: Linux 4.4.0-67-generic #88~14.04.1-Ubuntu SMP
Filesystems: ext4 on Hardware RAID 6

We regularly run a backup script, that mainly utilities rsync and mv. When there is a lot of change, the server sometimes freezes and can only be recovered by power cycling. I thought it was a hardware problem, but we have this problem now on 2 out of 18 identical machines. They have different BIOS versions. So probably, it's related to the amount of data. During the process I see high load by the processes rsync and chmod.

Kernel messages:
Apr 2 01:09:58 server kernel: [483707.688686] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [kswapd0:83]
Apr 2 01:09:58 server kernel: [483707.688716] Modules linked in: drbg ansi_cprng ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables 8021q garp mrp bridge stp llc dm_crypt intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass ipmi_devintf crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac dcdbas edac_core acpi_power_meter shpchp ipmi_si mei_me input_leds lpc_ich ipmi_msghandler mei 8250_fintek mac_hid parport_pc ppdev lp parport igb dca ptp hid_generic usbhid hid ahci pps_core libahci i2c_algo_bit megaraid_sas wmi fjes
Apr 2 01:09:58 server kernel: [483707.688718] CPU: 7 PID: 83 Comm: kswapd0 Tainted: G L 4.4.0-67-generic #88~14.04.1-Ubuntu
Apr 2 01:09:58 server kernel: [483707.688719] Hardware name: Dell Inc. PowerEdge T630, BIOS 1.5.4 10/04/2015
Apr 2 01:09:58 server kernel: [483707.688720] task: ffff881034ac6200 ti: ffff88102da44000 task.ti: ffff88102da44000
Apr 2 01:09:58 server kernel: [483707.688722] RIP: 0010:[<ffffffff810c671a>] [<ffffffff810c671a>] native_queued_spin_lock_slowpath+0x10a/0x170
Apr 2 01:09:58 server kernel: [483707.688723] RSP: 0018:ffff88102da47c58 EFLAGS: 00000246
Apr 2 01:09:58 server kernel: [483707.688724] RAX: 0000000000000000 RBX: 000000000000037a RCX: ffff88103d3d7940
Apr 2 01:09:58 server kernel: [483707.688725] RDX: ffff88103d417940 RSI: 0000000000200000 RDI: ffffffff821dc7e0
Apr 2 01:09:58 server kernel: [483707.688725] RBP: ffff88102da47c58 R08: 0000000000000101 R09: 28f5c28f5c28f5c3
Apr 2 01:09:58 server kernel: [483707.688726] R10: 0000000000000000 R11: ffff88102da47a58 R12: 0000000000000080
Apr 2 01:09:58 server kernel: [483707.688727] R13: 0000000000000000 R14: ffffffff81e8ae40 R15: 0000000000007ace
Apr 2 01:09:58 server kernel: [483707.688728] FS: 0000000000000000(0000) GS:ffff88103d3c0000(0000) knlGS:0000000000000000
Apr 2 01:09:58 server kernel: [483707.688728] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 2 01:09:58 server kernel: [483707.688729] CR2: 00007ff3c624c0f2 CR3: 0000000001e0c000 CR4: 00000000001426e0
Apr 2 01:09:58 server kernel: [483707.688730] Stack:
Apr 2 01:09:58 server kernel: [483707.688731] ffff88102da47c68 ffffffff81183477 ffff88102da47c78 ffffffff81806af0
Apr 2 01:09:58 server kernel: [483707.688733] ffff88102da47c88 ffffffff8125dfd5 ffff88102da47d60 ffffffff8119601a
Apr 2 01:09:58 server kernel: [483707.688734] 0000000000000000 0000000000000000 ffff880da9fdf340 0000000000e86866
Apr 2 01:09:58 server kernel: [483707.688735] Call Trace:
Apr 2 01:09:58 server kernel: [483707.688737] [<ffffffff81183477>] queued_spin_lock_slowpath+0xb/0xf
Apr 2 01:09:58 server kernel: [483707.688739] [<ffffffff81806af0>] _raw_spin_lock+0x20/0x30
Apr 2 01:09:58 server kernel: [483707.688740] [<ffffffff8125dfd5>] mb_cache_shrink_count+0x15/0xb0
Apr 2 01:09:58 server kernel: [483707.688742] [<ffffffff8119601a>] shrink_slab.part.40+0x10a/0x3f0
Apr 2 01:09:58 server kernel: [483707.688744] [<ffffffff8119a6f7>] shrink_zone+0x2a7/0x2c0
Apr 2 01:09:58 server kernel: [483707.688746] [<ffffffff8119b6c7>] kswapd+0x4c7/0x970
Apr 2 01:09:58 server kernel: [483707.688749] [<ffffffff8119b200>] ? mem_cgroup_shrink_node_zone+0x190/0x190
Apr 2 01:09:58 server kernel: [483707.688750] [<ffffffff8109cd19>] kthread+0xc9/0xe0
Apr 2 01:09:58 server kernel: [483707.688752] [<ffffffff8109cc50>] ? kthread_park+0x60/0x60
Apr 2 01:09:58 server kernel: [483707.688753] [<ffffffff8180724f>] ret_from_fork+0x3f/0x70
Apr 2 01:09:58 server kernel: [483707.688754] [<ffffffff8109cc50>] ? kthread_park+0x60/0x60
Apr 2 01:09:58 server kernel: [483707.688772] Code: c2 c1 e8 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 79 01 00 48 03 14 c5 00 99 f3 81 48 89 0a 8b 41 08 85 c0 75 0d f3 90 <8b> 41 08 85 c0 74 f7 eb 02 f3 90 8b 17 66 85 d2 75 f7 39 f2 66
Apr 2 01:09:58 server kernel: [483707.698419] Modules linked in: drbg ansi_cprng ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables 8021q garp mrp bridge stp llc dm_crypt intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass ipmi_devintf crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac dcdbas edac_core acpi_power_meter shpchp ipmi_si mei_me input_leds lpc_ich ipmi_msghandler mei 8250_fintek mac_hid parport_pc ppdev lp parport igb dca ptp hid_generic usbhid hid ahci pps_core libahci i2c_algo_bit megaraid_sas wmi fjes
Apr 2 01:09:58 server kernel: [483707.698441] CPU: 3 PID: 3119 Comm: freshclam Tainted: G L 4.4.
0-67-generic #88~14.04.1-Ubuntu
Apr 2 01:09:58 server kernel: [483707.698441] Hardware name: Dell Inc. PowerEdge T630, BIOS 1.5.4 10/0
4/2015
Apr 2 01:09:58 server kernel: [483707.698443] task: ffff88102b9b3800 ti: ffff88102ef28000 task.ti: ffff88102e
f28000
Apr 2 01:09:58 server kernel: [483707.698444] RIP: 0010:[<ffffffff810c671d>] [<ffffffff810c671d>] native_que
ued_spin_lock_slowpath+0x10d/0x170
Apr 2 01:09:58 server kernel: [483707.698447] RSP: 0018:ffff88102ef2b7c0 EFLAGS: 00000246
Apr 2 01:09:58 server kernel: [483707.698448] RAX: 0000000000000000 RBX: 000000000000037a RCX: ffff88103d2d79
40
Apr 2 01:09:58 server kernel: [483707.698448] RDX: ffff88103d3d7940 RSI: 0000000000100000 RDI: ffffffff821dc7
e0
Apr 2 01:09:58 server kernel: [483707.698449] RBP: ffff88102ef2b7c0 R08: 0000000000000101 R09: 28f5c28f5c28f5
c3
Apr 2 01:09:58 server kernel: [483707.698450] R10: 0000000000000000 R11: ffff88102ef2b5c8 R12: 00000000000000
80
Apr 2 01:09:58 server kernel: [483707.698451] R13: 0000000000000000 R14: ffffffff81e8ae40 R15: 0000000000007a
ce
Apr 2 01:09:58 server kernel: [483707.698452] FS: 00007fe59bc02780(0000) GS:ffff88103d2c0000(0000) knlGS:000
0000000000000
Apr 2 01:09:58 server kernel: [483707.698453] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 2 01:09:58 server kernel: [483707.698454] CR2: 00007fe59bc13000 CR3: 000000102c83f000 CR4: 00000000001426
e0
Apr 2 01:09:58 server kernel: [483707.698455] Stack:
Apr 2 01:09:58 server kernel: [483707.698456] ffff88102ef2b7d0 ffffffff81183477 ffff88102ef2b7e0 ffffffff818
06af0
Apr 2 01:09:58 server kernel: [483707.698457] ffff88102ef2b7f0 ffffffff8125dfd5 ffff88102ef2b8c8 ffffffff811
9601a
Apr 2 01:09:58 server kernel: [483707.698459] 0000000000000003 0000000000000001 0000000000000000 0000000000e
876d8
Apr 2 01:09:58 server kernel: [483707.698461] Call Trace:
Apr 2 01:09:58 server kernel: [483707.698463] [<ffffffff81183477>] queued_spin_lock_slowpath+0xb/0xf
Apr 2 01:09:58 server kernel: [483707.698465] [<ffffffff81806af0>] _raw_spin_lock+0x20/0x30
Apr 2 01:09:58 server kernel: [483707.698467] [<ffffffff8125dfd5>] mb_cache_shrink_count+0x15/0xb0
Apr 2 01:09:58 server kernel: [483707.698469] [<ffffffff8119601a>] shrink_slab.part.40+0x10a/0x3f0
Apr 2 01:09:58 server kernel: [483707.698471] [<ffffffff8119a6f7>] shrink_zone+0x2a7/0x2c0
Apr 2 01:09:58 server kernel: [483707.698473] [<ffffffff8119aa86>] do_try_to_free_pages+0x166/0x3d0
Apr 2 01:09:58 server kernel: [483707.698475] [<ffffffff81197dfd>] ? throttle_direct_reclaim+0x8d/0x230
Apr 2 01:09:58 server kernel: [483707.698477] [<ffffffff8119ada5>] try_to_free_pages+0xb5/0x170
Apr 2 01:09:58 server kernel: [483707.698479] [<ffffffff811fbb6e>] __alloc_pages_slowpath.constprop.87+0x323/0x78c
Apr 2 01:09:58 server kernel: [483707.698482] [<ffffffff8118e3c7>] __alloc_pages_nodemask+0x237/0x240
Apr 2 01:09:58 server kernel: [483707.698483] [<ffffffff811d4298>] alloc_pages_current+0x88/0x120
Apr 2 01:09:58 server kernel: [483707.698485] [<ffffffff8118562e>] __page_cache_alloc+0xae/0xc0
Apr 2 01:09:58 server kernel: [483707.698487] [<ffffffff81186029>] pagecache_get_page+0x59/0x1c0
Apr 2 01:09:58 server kernel: [483707.698488] [<ffffffff811861b6>] grab_cache_page_write_begin+0x26/0x40
Apr 2 01:09:58 server kernel: [483707.698490] [<ffffffff8128e6d1>] ext4_da_write_begin+0xa1/0x330
Apr 2 01:09:58 server kernel: [483707.698492] [<ffffffff811851f0>] generic_perform_write+0xc0/0x1a0
Apr 2 01:09:58 server kernel: [483707.698494] [<ffffffff8121a89b>] ? file_update_time+0x3b/0xf0
Apr 2 01:09:58 server kernel: [483707.698496] [<ffffffff811873a7>] __generic_file_write_iter+0x197/0x1e0
Apr 2 01:09:58 server kernel: [483707.698498] [<ffffffff812832e6>] ext4_file_write_iter+0xf6/0x360
Apr 2 01:09:58 server kernel: [483707.698500] [<ffffffff812008f8>] new_sync_write+0x88/0xb0
Apr 2 01:09:58 server kernel: [483707.698501] [<ffffffff81200947>] __vfs_write+0x27/0x40
Apr 2 01:09:58 server kernel: [483707.698503] [<ffffffff81200f52>] vfs_write+0xa2/0x1a0
Apr 2 01:09:58 server kernel: [483707.698504] [<ffffffff81201c76>] SyS_write+0x46/0xa0
Apr 2 01:09:58 server kernel: [483707.698506] [<ffffffff81806eb6>] entry_SYSCALL_64_fastpath+0x16/0x75
Apr 2 01:09:58 server kernel: [483707.698507] Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 79 01 00 48 03 14 c5 00 99 f3 81 48 89 0a 8b 41 08 85 c0 75 0d f3 90 8b 41 08 <85> c0 74 f7 eb 02 f3 90 8b 17 66 85 d2 75 f7 39 f2 66 90 75 0f

The problem exists for a while now. None of the latest kernel updates helped. Can you please advice me what do do? Thank you!

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-4.4.0-67-generic 4.4.0-67.88~14.04.1
ProcVersionSignature: Ubuntu 4.4.0-67.88~14.04.1-generic 4.4.49
Uname: Linux 4.4.0-67-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.23
Architecture: amd64
Date: Tue Apr 4 12:38:13 2017
InstallationDate: Installed on 2016-02-22 (406 days ago)
InstallationMedia: Ubuntu-Server 14.04 LTS "Trusty Tahr" - Release amd64 (20140416.2)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-lts-xenial
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
t2d (t2d) wrote :
Revision history for this message
t2d (t2d) wrote :

We ran all tests with Dell support and they tell us, there is definitely no hardware problem. Any ideas on how to proceed? Thanks

Revision history for this message
t2d (t2d) wrote :

The error was triggered by a subcommand of the backup script.

> rsync -avAX --delete --delete-during --numeric-ids /var/backup/daily.0/ /var/backup/weekly.0

Somehow ext4 and rsync didn't like the windows attributes of some of the files. Once I only did "rsync -av" the problem was gone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.