Activity log for bug #1680513

Date Who What changed Old value New value Message
2017-04-06 15:44:54 Gavin Guo bug added bug
2017-04-06 15:49:56 Gavin Guo description After numad is enabled and there are several VMs running on the same host machine, the softlockup messages can be observed inside the VMs' dmesg. First, the crashdump was captured when the symptom was observed. At the first glance, it looks like an IPI lost issue. The numad process initiates a migration of memory, and as part of this, needs to flush the TLB cache of another CPU. When the crash dump was taken, that other CPU has the TLB flush pending, but not executed. The numad kernel task is holding a semaphore lock mmap_sem(for the VM's memory) to do the migration, and the tasks that actually end up being blocked are other virtual CPUs for the same VM. These tasks need to access or make changes to the memory map for the VM because of the VM page fault, but cannot acquire the semaphore lock. However, the original thoughts on the root cause (unhandled IPI or csd lock issue) are incorrect. We originally suspected an issue with a lost IPI (inter processor interrupt) that performs remote CPU cache flushes during page migration, or a known issue with the "csd" lock used to synchronize the remote CPU cache flush. A lost IPI would be a function of the system firmware or chipset (it is not a CPU issue), but the known csd issue is hardware independent. Gavin created the hotfix kernel with changes in the csd_lock_wait function that would time out if the unlock never happens (the end result of either cause), and print messages to the console when that timeout occurred. The messages look like: csd_lock_wait called %d times csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d However, the VMs are still experiencing the hangs, but the csd_lock_wait timeout is not happening. This suggests that the csd lock / lost IPI is not the actual cause. In the crash dump, the numad task has induced a migration, and the stack is as follows: #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages The frame #1 is actually in the csd_lock_wait function mentioned above, but the compiler has optimized that call and it does not appear in the stack. What happens here is that do_migrate_pages (frame #10) acquires the semaphore that everything else is waiting for (and that eventually produce the hang warnings), and it holds that semaphore for the duration of the page migration. This strongly suggests that this single do_migrate_pages call is taking in excess of 10 seconds, and if the csd lock is not stuck, then something else within its call path is not functioning correctly. We originally suspected that the lost IPI/csd lock hang was responsible for the hung task timeouts, but in the absence of the csd warning messages, the cause presumably lies elsewhere. A KSM function appears in frame #6; this is the function that will search out the merged pages to handle them for the migration. Gavin have tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries: rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst This is roughly 9 GB of pages. The theory is that KSM has merged a very large number of pages that are empty (the value of all locations in the page are zero). Andrea Arcangeli already sent out the patch[1] in the 2015/11/10. Andrew Morton also said he will apply the patch. However, the patch finally disappears from the mmtom tree in April 2016. [1]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit http://www.spinics.net/lists/linux-mm/msg96866.html After numad is enabled and there are several VMs running on the same host machine(host kernel version: 4.4.0-72-generic #93), the softlockup messages can be observed inside the VMs' dmesg. First, the crashdump was captured when the symptom was observed. At the first glance, it looks like an IPI lost issue. The numad process initiates a migration of memory, and as part of this, needs to flush the TLB cache of another CPU. When the crash dump was taken, that other CPU has the TLB flush pending, but not executed. The numad kernel task is holding a semaphore lock mmap_sem(for the VM's memory) to do the migration, and the tasks that actually end up being blocked are other virtual CPUs for the same VM. These tasks need to access or make changes to the memory map for the VM because of the VM page fault, but cannot acquire the semaphore lock. However, the original thoughts on the root cause (unhandled IPI or csd lock issue) are incorrect. We originally suspected an issue with a lost IPI (inter processor interrupt) that performs remote CPU cache flushes during page migration, or a known issue with the "csd" lock used to synchronize the remote CPU cache flush. A lost IPI would be a function of the system firmware or chipset (it is not a CPU issue), but the known csd issue is hardware independent. Gavin created the hotfix kernel with changes in the csd_lock_wait function that would time out if the unlock never happens (the end result of either cause), and print messages to the console when that timeout occurred. The messages look like: csd_lock_wait called %d times csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d However, the VMs are still experiencing the hangs, but the csd_lock_wait timeout is not happening. This suggests that the csd lock / lost IPI is not the actual cause. In the crash dump, the numad task has induced a migration, and the stack is as follows: #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages The frame #1 is actually in the csd_lock_wait function mentioned above, but the compiler has optimized that call and it does not appear in the stack. What happens here is that do_migrate_pages (frame #10) acquires the semaphore that everything else is waiting for (and that eventually produce the hang warnings), and it holds that semaphore for the duration of the page migration. This strongly suggests that this single do_migrate_pages call is taking in excess of 10 seconds, and if the csd lock is not stuck, then something else within its call path is not functioning correctly. We originally suspected that the lost IPI/csd lock hang was responsible for the hung task timeouts, but in the absence of the csd warning messages, the cause presumably lies elsewhere. A KSM function appears in frame #6; this is the function that will search out the merged pages to handle them for the migration. Gavin have tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries: rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst This is roughly 9 GB of pages. The theory is that KSM has merged a very large number of pages that are empty (the value of all locations in the page are zero). Andrea Arcangeli already sent out the patch[1] in the 2015/11/10. Andrew Morton also said he will apply the patch. However, the patch finally disappears from the mmtom tree in April 2016. [1]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit http://www.spinics.net/lists/linux-mm/msg96866.html
2017-04-06 15:54:01 Joseph Salisbury tags sts kernel-da-key sts
2017-04-06 15:54:08 Joseph Salisbury linux (Ubuntu): importance Undecided Medium
2017-04-06 15:55:01 Gavin Guo linux (Ubuntu): assignee Gavin Guo (mimi0213kimo)
2017-04-06 16:00:14 Brad Figg linux (Ubuntu): status New Incomplete
2017-04-14 03:10:57 Gavin Guo description After numad is enabled and there are several VMs running on the same host machine(host kernel version: 4.4.0-72-generic #93), the softlockup messages can be observed inside the VMs' dmesg. First, the crashdump was captured when the symptom was observed. At the first glance, it looks like an IPI lost issue. The numad process initiates a migration of memory, and as part of this, needs to flush the TLB cache of another CPU. When the crash dump was taken, that other CPU has the TLB flush pending, but not executed. The numad kernel task is holding a semaphore lock mmap_sem(for the VM's memory) to do the migration, and the tasks that actually end up being blocked are other virtual CPUs for the same VM. These tasks need to access or make changes to the memory map for the VM because of the VM page fault, but cannot acquire the semaphore lock. However, the original thoughts on the root cause (unhandled IPI or csd lock issue) are incorrect. We originally suspected an issue with a lost IPI (inter processor interrupt) that performs remote CPU cache flushes during page migration, or a known issue with the "csd" lock used to synchronize the remote CPU cache flush. A lost IPI would be a function of the system firmware or chipset (it is not a CPU issue), but the known csd issue is hardware independent. Gavin created the hotfix kernel with changes in the csd_lock_wait function that would time out if the unlock never happens (the end result of either cause), and print messages to the console when that timeout occurred. The messages look like: csd_lock_wait called %d times csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d However, the VMs are still experiencing the hangs, but the csd_lock_wait timeout is not happening. This suggests that the csd lock / lost IPI is not the actual cause. In the crash dump, the numad task has induced a migration, and the stack is as follows: #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages The frame #1 is actually in the csd_lock_wait function mentioned above, but the compiler has optimized that call and it does not appear in the stack. What happens here is that do_migrate_pages (frame #10) acquires the semaphore that everything else is waiting for (and that eventually produce the hang warnings), and it holds that semaphore for the duration of the page migration. This strongly suggests that this single do_migrate_pages call is taking in excess of 10 seconds, and if the csd lock is not stuck, then something else within its call path is not functioning correctly. We originally suspected that the lost IPI/csd lock hang was responsible for the hung task timeouts, but in the absence of the csd warning messages, the cause presumably lies elsewhere. A KSM function appears in frame #6; this is the function that will search out the merged pages to handle them for the migration. Gavin have tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries: rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst This is roughly 9 GB of pages. The theory is that KSM has merged a very large number of pages that are empty (the value of all locations in the page are zero). Andrea Arcangeli already sent out the patch[1] in the 2015/11/10. Andrew Morton also said he will apply the patch. However, the patch finally disappears from the mmtom tree in April 2016. [1]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit http://www.spinics.net/lists/linux-mm/msg96866.html [Impact] After numad is enabled and there are several VMs running on the same host machine(host kernel version: 4.4.0-72-generic #93), the softlockup messages can be observed inside the VMs' dmesg. First, the crashdump was captured when the symptom was observed. At the first glance, it looks like an IPI lost issue. The numad process initiates a migration of memory, and as part of this, needs to flush the TLB cache of another CPU. When the crash dump was taken, that other CPU has the TLB flush pending, but not executed. The numad kernel task is holding a semaphore lock mmap_sem(for the VM's memory) to do the migration, and the tasks that actually end up being blocked are other virtual CPUs for the same VM. These tasks need to access or make changes to the memory map for the VM because of the VM page fault, but cannot acquire the semaphore lock. However, the original thoughts on the root cause (unhandled IPI or csd lock issue) are incorrect. We originally suspected an issue with a lost IPI (inter processor interrupt) that performs remote CPU cache flushes during page migration, or a known issue with the "csd" lock used to synchronize the remote CPU cache flush. A lost IPI would be a function of the system firmware or chipset (it is not a CPU issue), but the known csd issue is hardware independent. Gavin created the hotfix kernel with changes in the csd_lock_wait function that would time out if the unlock never happens (the end result of either cause), and print messages to the console when that timeout occurred. The messages look like: csd_lock_wait called %d times csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d However, the VMs are still experiencing the hangs, but the csd_lock_wait timeout is not happening. This suggests that the csd lock / lost IPI is not the actual cause. In the crash dump, the numad task has induced a migration, and the stack is as follows: #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages The frame #1 is actually in the csd_lock_wait function mentioned above, but the compiler has optimized that call and it does not appear in the stack. What happens here is that do_migrate_pages (frame #10) acquires the semaphore that everything else is waiting for (and that eventually produce the hang warnings), and it holds that semaphore for the duration of the page migration. This strongly suggests that this single do_migrate_pages call is taking in excess of 10 seconds, and if the csd lock is not stuck, then something else within its call path is not functioning correctly. We originally suspected that the lost IPI/csd lock hang was responsible for the hung task timeouts, but in the absence of the csd warning messages, the cause presumably lies elsewhere. A KSM function appears in frame #6; this is the function that will search out the merged pages to handle them for the migration. Gavin have tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries: rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst This is roughly 9 GB of pages. The theory is that KSM has merged a very large number of pages that are empty (the value of all locations in the page are zero). [Fix] Andrea Arcangeli already sent out the patch[1] in the 2015/11/10. Andrew Morton also said he will apply the patch. However, the patch finally disappears from the mmtom tree in April 2016. [1]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit http://www.spinics.net/lists/linux-mm/msg96866.html [Test Case] The patch has been tested.
2017-04-14 11:57:36 Gavin Guo description [Impact] After numad is enabled and there are several VMs running on the same host machine(host kernel version: 4.4.0-72-generic #93), the softlockup messages can be observed inside the VMs' dmesg. First, the crashdump was captured when the symptom was observed. At the first glance, it looks like an IPI lost issue. The numad process initiates a migration of memory, and as part of this, needs to flush the TLB cache of another CPU. When the crash dump was taken, that other CPU has the TLB flush pending, but not executed. The numad kernel task is holding a semaphore lock mmap_sem(for the VM's memory) to do the migration, and the tasks that actually end up being blocked are other virtual CPUs for the same VM. These tasks need to access or make changes to the memory map for the VM because of the VM page fault, but cannot acquire the semaphore lock. However, the original thoughts on the root cause (unhandled IPI or csd lock issue) are incorrect. We originally suspected an issue with a lost IPI (inter processor interrupt) that performs remote CPU cache flushes during page migration, or a known issue with the "csd" lock used to synchronize the remote CPU cache flush. A lost IPI would be a function of the system firmware or chipset (it is not a CPU issue), but the known csd issue is hardware independent. Gavin created the hotfix kernel with changes in the csd_lock_wait function that would time out if the unlock never happens (the end result of either cause), and print messages to the console when that timeout occurred. The messages look like: csd_lock_wait called %d times csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d However, the VMs are still experiencing the hangs, but the csd_lock_wait timeout is not happening. This suggests that the csd lock / lost IPI is not the actual cause. In the crash dump, the numad task has induced a migration, and the stack is as follows: #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages The frame #1 is actually in the csd_lock_wait function mentioned above, but the compiler has optimized that call and it does not appear in the stack. What happens here is that do_migrate_pages (frame #10) acquires the semaphore that everything else is waiting for (and that eventually produce the hang warnings), and it holds that semaphore for the duration of the page migration. This strongly suggests that this single do_migrate_pages call is taking in excess of 10 seconds, and if the csd lock is not stuck, then something else within its call path is not functioning correctly. We originally suspected that the lost IPI/csd lock hang was responsible for the hung task timeouts, but in the absence of the csd warning messages, the cause presumably lies elsewhere. A KSM function appears in frame #6; this is the function that will search out the merged pages to handle them for the migration. Gavin have tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries: rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst This is roughly 9 GB of pages. The theory is that KSM has merged a very large number of pages that are empty (the value of all locations in the page are zero). [Fix] Andrea Arcangeli already sent out the patch[1] in the 2015/11/10. Andrew Morton also said he will apply the patch. However, the patch finally disappears from the mmtom tree in April 2016. [1]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit http://www.spinics.net/lists/linux-mm/msg96866.html [Test Case] The patch has been tested. [Impact] After numad is enabled and there are several VMs running on the same host machine(host kernel version: 4.4.0-72-generic #93), the softlockup messages can be observed inside the VMs' dmesg. First, the crashdump was captured when the symptom was observed. At the first glance, it looks like an IPI lost issue. The numad process initiates a migration of memory, and as part of this, needs to flush the TLB cache of another CPU. When the crash dump was taken, that other CPU has the TLB flush pending, but not executed. The numad kernel task is holding a semaphore lock mmap_sem(for the VM's memory) to do the migration, and the tasks that actually end up being blocked are other virtual CPUs for the same VM. These tasks need to access or make changes to the memory map for the VM because of the VM page fault, but cannot acquire the semaphore lock. However, the original thoughts on the root cause (unhandled IPI or csd lock issue) are incorrect. We originally suspected an issue with a lost IPI (inter processor interrupt) that performs remote CPU cache flushes during page migration, or a known issue with the "csd" lock used to synchronize the remote CPU cache flush. A lost IPI would be a function of the system firmware or chipset (it is not a CPU issue), but the known csd issue is hardware independent. Gavin created the hotfix kernel with changes in the csd_lock_wait function that would time out if the unlock never happens (the end result of either cause), and print messages to the console when that timeout occurred. The messages look like: csd_lock_wait called %d times csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d However, the VMs are still experiencing the hangs, but the csd_lock_wait timeout is not happening. This suggests that the csd lock / lost IPI is not the actual cause. In the crash dump, the numad task has induced a migration, and the stack is as follows: #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages The frame #1 is actually in the csd_lock_wait function mentioned above, but the compiler has optimized that call and it does not appear in the stack. What happens here is that do_migrate_pages (frame #10) acquires the semaphore that everything else is waiting for (and that eventually produce the hang warnings), and it holds that semaphore for the duration of the page migration. This strongly suggests that this single do_migrate_pages call is taking in excess of 10 seconds, and if the csd lock is not stuck, then something else within its call path is not functioning correctly. We originally suspected that the lost IPI/csd lock hang was responsible for the hung task timeouts, but in the absence of the csd warning messages, the cause presumably lies elsewhere. A KSM function appears in frame #6; this is the function that will search out the merged pages to handle them for the migration. Gavin have tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries: rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst This is roughly 9 GB of pages. The theory is that KSM has merged a very large number of pages that are empty (the value of all locations in the page are zero). The bug can be observed by the perf flame graph[1]: [1]. http://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg [Fix] Andrea Arcangeli already sent out the patch[2] in the 2015/11/10. Andrew Morton also said he will apply the patch. However, the patch finally disappears from the mmtom tree in April 2016. [2]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit http://www.spinics.net/lists/linux-mm/msg96866.html [Test Case] The patch has been tested with 9 VMs and each has 32GB ram and 16 CPUs. Numad/KSM are also enabled in the machine. After running for almost 2 days, the system is stable and unstable CPU loading cannot be observed inside the virtual appliances monitor[3]. The numad cpu utilization rate is normal and guest hang also cannot be observed. Machine type: Dell PowerEdge R920 Memory: 528GB with 4 NUMA nodes CPU: 120 cores
2017-04-18 11:51:30 Gavin Guo description [Impact] After numad is enabled and there are several VMs running on the same host machine(host kernel version: 4.4.0-72-generic #93), the softlockup messages can be observed inside the VMs' dmesg. First, the crashdump was captured when the symptom was observed. At the first glance, it looks like an IPI lost issue. The numad process initiates a migration of memory, and as part of this, needs to flush the TLB cache of another CPU. When the crash dump was taken, that other CPU has the TLB flush pending, but not executed. The numad kernel task is holding a semaphore lock mmap_sem(for the VM's memory) to do the migration, and the tasks that actually end up being blocked are other virtual CPUs for the same VM. These tasks need to access or make changes to the memory map for the VM because of the VM page fault, but cannot acquire the semaphore lock. However, the original thoughts on the root cause (unhandled IPI or csd lock issue) are incorrect. We originally suspected an issue with a lost IPI (inter processor interrupt) that performs remote CPU cache flushes during page migration, or a known issue with the "csd" lock used to synchronize the remote CPU cache flush. A lost IPI would be a function of the system firmware or chipset (it is not a CPU issue), but the known csd issue is hardware independent. Gavin created the hotfix kernel with changes in the csd_lock_wait function that would time out if the unlock never happens (the end result of either cause), and print messages to the console when that timeout occurred. The messages look like: csd_lock_wait called %d times csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d However, the VMs are still experiencing the hangs, but the csd_lock_wait timeout is not happening. This suggests that the csd lock / lost IPI is not the actual cause. In the crash dump, the numad task has induced a migration, and the stack is as follows: #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages The frame #1 is actually in the csd_lock_wait function mentioned above, but the compiler has optimized that call and it does not appear in the stack. What happens here is that do_migrate_pages (frame #10) acquires the semaphore that everything else is waiting for (and that eventually produce the hang warnings), and it holds that semaphore for the duration of the page migration. This strongly suggests that this single do_migrate_pages call is taking in excess of 10 seconds, and if the csd lock is not stuck, then something else within its call path is not functioning correctly. We originally suspected that the lost IPI/csd lock hang was responsible for the hung task timeouts, but in the absence of the csd warning messages, the cause presumably lies elsewhere. A KSM function appears in frame #6; this is the function that will search out the merged pages to handle them for the migration. Gavin have tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries: rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst This is roughly 9 GB of pages. The theory is that KSM has merged a very large number of pages that are empty (the value of all locations in the page are zero). The bug can be observed by the perf flame graph[1]: [1]. http://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg [Fix] Andrea Arcangeli already sent out the patch[2] in the 2015/11/10. Andrew Morton also said he will apply the patch. However, the patch finally disappears from the mmtom tree in April 2016. [2]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit http://www.spinics.net/lists/linux-mm/msg96866.html [Test Case] The patch has been tested with 9 VMs and each has 32GB ram and 16 CPUs. Numad/KSM are also enabled in the machine. After running for almost 2 days, the system is stable and unstable CPU loading cannot be observed inside the virtual appliances monitor[3]. The numad cpu utilization rate is normal and guest hang also cannot be observed. Machine type: Dell PowerEdge R920 Memory: 528GB with 4 NUMA nodes CPU: 120 cores [Impact] After numad is enabled and there are several VMs running on the same host machine(host kernel version: 4.4.0-72-generic #93), the softlockup messages can be observed inside the VMs' dmesg. First, the crashdump was captured when the symptom was observed. At the first glance, it looks like an IPI lost issue. The numad process initiates a migration of memory, and as part of this, needs to flush the TLB cache of another CPU. When the crash dump was taken, that other CPU has the TLB flush pending, but not executed. The numad kernel task is holding a semaphore lock mmap_sem(for the VM's memory) to do the migration, and the tasks that actually end up being blocked are other virtual CPUs for the same VM. These tasks need to access or make changes to the memory map for the VM because of the VM page fault, but cannot acquire the semaphore lock. However, the original thoughts on the root cause (unhandled IPI or csd lock issue) are incorrect. We originally suspected an issue with a lost IPI (inter processor interrupt) that performs remote CPU cache flushes during page migration, or a known issue with the "csd" lock used to synchronize the remote CPU cache flush. A lost IPI would be a function of the system firmware or chipset (it is not a CPU issue), but the known csd issue is hardware independent. Gavin created the hotfix kernel with changes in the csd_lock_wait function that would time out if the unlock never happens (the end result of either cause), and print messages to the console when that timeout occurred. The messages look like: csd_lock_wait called %d times csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d However, the VMs are still experiencing the hangs, but the csd_lock_wait timeout is not happening. This suggests that the csd lock / lost IPI is not the actual cause. In the crash dump, the numad task has induced a migration, and the stack is as follows: #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages The frame #1 is actually in the csd_lock_wait function mentioned above, but the compiler has optimized that call and it does not appear in the stack. What happens here is that do_migrate_pages (frame #10) acquires the semaphore that everything else is waiting for (and that eventually produce the hang warnings), and it holds that semaphore for the duration of the page migration. This strongly suggests that this single do_migrate_pages call is taking in excess of 10 seconds, and if the csd lock is not stuck, then something else within its call path is not functioning correctly. We originally suspected that the lost IPI/csd lock hang was responsible for the hung task timeouts, but in the absence of the csd warning messages, the cause presumably lies elsewhere. A KSM function appears in frame #6; this is the function that will search out the merged pages to handle them for the migration. Gavin have tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries: rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst This is roughly 9 GB of pages. The theory is that KSM has merged a very large number of pages that are empty (the value of all locations in the page are zero). The bug can be observed by the perf flame graph[1]: [1]. http://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg [Fix] Andrea Arcangeli already sent out the patch[2] in the 2015/11/10. Andrew Morton also said he will apply the patch. However, the patch finally disappears from the mmtom tree in April 2016. Andrea suggested apply the 3 patches[3]. [2]. [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit http://www.spinics.net/lists/linux-mm/msg96866.html [3]. Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit https://www.spinics.net/lists/linux-mm/msg113829.html [Test Case] The patches has been tested with 9 VMs and each has 32GB ram and 16 VCPUs. Numad/KSM are also enabled in the machine. After running for 6 days, the system is stable and unstable CPU loading cannot be observed inside the virtual appliances monitor[4]. The numad cpu utilization rate is normal and guest hang also cannot be observed. Machine type: Dell PowerEdge R920 Memory: 528GB with 4 NUMA nodes CPU: 120 cores [4]. http://kernel.ubuntu.com/~gavinguo/sf00131845/virtual_appliances_loading.png
2017-07-21 08:30:24 Stefan Bader nominated for series Ubuntu Zesty
2017-07-21 08:30:24 Stefan Bader bug task added linux (Ubuntu Zesty)
2017-07-21 08:30:24 Stefan Bader nominated for series Ubuntu Xenial
2017-07-21 08:30:24 Stefan Bader bug task added linux (Ubuntu Xenial)
2017-07-24 13:53:25 Seth Forshee linux (Ubuntu): status Incomplete Fix Committed
2017-08-07 11:34:33 Kleber Sacilotto de Souza linux (Ubuntu Zesty): status New Fix Committed
2017-08-10 05:56:17 Launchpad Janitor linux (Ubuntu): status Fix Committed Fix Released
2017-08-10 05:56:17 Launchpad Janitor cve linked 2017-1000364
2017-08-10 05:56:17 Launchpad Janitor cve linked 2017-10810
2017-08-10 05:56:17 Launchpad Janitor cve linked 2017-7533
2017-08-16 16:34:03 Kleber Sacilotto de Souza tags kernel-da-key sts kernel-da-key sts verification-needed-zesty
2017-08-18 09:59:43 Gavin Guo tags kernel-da-key sts verification-needed-zesty kernel-da-key sts verification-done-zesty
2017-08-23 15:20:01 Kleber Sacilotto de Souza linux (Ubuntu Xenial): status New Fix Committed
2017-08-28 10:14:27 Launchpad Janitor linux (Ubuntu Zesty): status Fix Committed Fix Released
2017-08-28 10:14:27 Launchpad Janitor cve linked 2017-1000111
2017-08-28 10:14:27 Launchpad Janitor cve linked 2017-1000112
2017-08-28 10:14:27 Launchpad Janitor cve linked 2017-7487
2017-09-01 08:28:02 Kleber Sacilotto de Souza tags kernel-da-key sts verification-done-zesty kernel-da-key sts verification-done-zesty verification-needed-xenial
2017-09-06 09:52:17 Gavin Guo tags kernel-da-key sts verification-done-zesty verification-needed-xenial kernel-da-key sts verification-done-xenial verification-done-zesty
2017-09-18 10:11:08 Launchpad Janitor linux (Ubuntu Xenial): status Fix Committed Fix Released
2017-09-18 10:11:08 Launchpad Janitor cve linked 2017-1000251