wrong stack size calculation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MySQL Server |
Unknown
|
Unknown
|
|||
Percona Server moved to https://jira.percona.com/projects/PS |
Fix Released
|
High
|
Vlad Lesin | ||
5.5 |
Fix Released
|
High
|
Vlad Lesin | ||
5.6 |
Fix Released
|
High
|
Vlad Lesin |
Bug Description
The following analysis made one of our users:
gdb tells us that the core was generated by a SIGSEGV, with the following stack trace:
(gdb) where
#0 msort_with_tmp (p=0x7ef5064122e0, b=0x7ef506412388, n=2) at msort.c:41
#1 0x00007f2b6dbf25b7 in msort_with_tmp (n=2, b=0x7ef506412388, p=0x7ef5064122e0) at msort.c:46
#2 msort_with_tmp (p=0x7ef5064122e0, b=0x7ef506412380, n=3) at msort.c:55
#3 0x00007f2b6dbf25b7 in msort_with_tmp (n=3, b=0x7ef506412380, p=0x7ef5064122e0) at msort.c:46
#4 msort_with_tmp (p=0x7ef5064122e0, b=0x7ef506412370, n=5) at msort.c:55
#5 0x00007f2b6dbf25b7 in msort_with_tmp (n=5, b=0x7ef506412370, p=0x7ef5064122e0) at msort.c:46
#6 msort_with_tmp (p=0x7ef5064122e0, b=0x7ef506412350, n=9) at msort.c:55
#7 0x00007f2b6dbf25a1 in msort_with_tmp (n=9, b=0x7ef506412350, p=0x7ef5064122e0) at msort.c:46
#8 msort_with_tmp (p=0x7ef5064122e0, b=0x7ef506412350, n=18) at msort.c:54
#9 0x00007f2b6dbf2abb in msort_with_tmp (p=0x7ef5064122e0, n=18, b=0x7ef506412350) at msort.c:46
#10 qsort_r (b=0x7ef506412350, n=18, s=<optimized out>, cmp=0x7f2b707eee10, arg=0x0) at msort.c:298
#11 0x00007f2b707ef201 in ?? ()
#12 0x00007f2b707f0152 in ?? ()
#13 0x00007f2b707f0211 in ?? ()
#14 0x00007f2b707f066a in lf_hash_insert ()
First thing we thought was weird that it was crashing in qsort, which is part of glibc. Diving into the assembly, we can see what instruction generated the core:
(gdb) disas
Dump of assembler code for function msort_with_tmp:
0x00007f2b6dbf2560 <+0>: push %r15
0x00007f2b6dbf2562 <+2>: push %r14
0x00007f2b6dbf2564 <+4>: push %r13
0x00007f2b6dbf2566 <+6>: mov %rdx,%r13
0x00007f2b6dbf2569 <+9>: shr %r13
0x00007f2b6dbf256c <+12>: push %r12
0x00007f2b6dbf256e <+14>: mov %rdx,%r12
0x00007f2b6dbf2571 <+17>: sub %r13,%r12
0x00007f2b6dbf2574 <+20>: push %rbp
0x00007f2b6dbf2575 <+21>: push %rbx
0x00007f2b6dbf2576 <+22>: mov %r13,%rbx
0x00007f2b6dbf2579 <+25>: sub $0x38,%rsp
0x00007f2b6dbf257d <+29>: imul (%rdi),%rbx
=> 0x00007f2b6dbf2581 <+33>: mov %rdi,0x18(%rsp)
0x00007f2b6dbf2586 <+38>: mov %rsi,0x20(%rsp)
0x00007f2b6dbf258b <+43>: mov %rdx,0x28(%rsp)
What this tells us is that the faulting instruction was mov %rdi,0x18(%rsp), which is trying to write the contents of the rdi register to offset 0x18 of the stack pointer. This is where things start to get interesting. The value of rsp is 0x7ef506411fe0:
(gdb) p $rsp
$1 = (void *) 0x7ef506411fe0
This is interesting because the value is near the end of the physical memory page (ends with fe0 -- we use 4k pages). Using readelf, we can dump out the memory mapping of the process, which is contained in the core file. What we're looking for is the mapping that contains the address 0x7ef506411fe0:
scott@db4:
...
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
... lots of output ...
LOAD 0x000000037b88a000 0x00007ef506411000 0x0000000000000000
0x0000000000001000 0x0000000000001000 1000
LOAD 0x000000037b88b000 0x00007ef506412000 0x0000000000000000
0x0000000000040000 0x0000000000040000 RW 1000
LOAD 0x000000037b8cb000 0x00007ef506452000 0x0000000000000000
0x0000000000001000 0x0000000000001000 1000
...
What this tells us is the mapping that contains 0x7ef506411fe0 is 0x1000 bytes (4k - or one page) and that it is NOT writable.
This is what caused this particular core. The stack pointer (rsp) is now pointing outside of the stack and into the guard page. When we try to write to the guard page it does the right thing and SIGSEGV's.
So, how does this happen? The offending function is _lf_pinbox_
(gdb) up
#12 0x00007f2b707f0152 in lfind (head=<optimized out>, cs=0x7f2b70e9b580, hashnr=3564606781, key=0x7f2b568e05d0 "/data/
pins=0x7ef39584
138 /mnt/workspace/
(gdb) p $rsp
$1 = (void *) 0x7ef50644c690
(gdb) down
#11 0x00007f2b707ef201 in _lf_pinbox_
at /mnt/workspace/
359 /mnt/workspace/
(gdb) p $rsp
$2 = (void *) 0x7ef506412350
(gdb) p 0x7ef50644c690 - 0x7ef506412350
$3 = 238400
At this point, $rsp is dangerously close to the end (beginning?) of the stack. When qsort gets called, it doesn't have enough stack space to complete, which causes the SIGSEGV when the guard page is touched by msort_with_tmp.
There are two problems here.
1) The value in pins->stack_
2) _lf_pinbox_
Related branches
- Percona core: Pending requested
-
Diff: 68 lines (+37/-3)2 files modifiedmysys/lf_alloc-pin.c (+10/-1)
mysys/my_thr_init.c (+27/-2)
- Laurynas Biveinis (community): Approve
- Registry Administrators: Pending requested
-
Diff: 69 lines (+38/-3)2 files modifiedmysys/lf_alloc-pin.c (+10/-1)
mysys/my_thr_init.c (+28/-2)
- Laurynas Biveinis (community): Approve
- Registry Administrators: Pending requested
-
Diff: 69 lines (+38/-3)2 files modifiedmysys/lf_alloc-pin.c (+10/-1)
mysys/my_thr_init.c (+28/-2)
Changed in percona-server: | |
assignee: | nobody → Vlad Lesin (vlad-lesin) |
tags: | added: upstream |
The upstream bug report is here: http:// bugs.mysql. com/bug. php?id= 73979 .