Connections to DB are refusing to die after VIP is switched
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla-ansible |
Fix Released
|
Medium
|
Michal Arbet | ||
Train |
New
|
Medium
|
Unassigned | ||
Ussuri |
Fix Committed
|
Medium
|
Unassigned | ||
Victoria |
Fix Committed
|
Medium
|
Unassigned | ||
Wallaby |
Fix Committed
|
Medium
|
Michal Arbet |
Bug Description
Hi,
On production kolla-ansible installed ENV we found strange bahaviour when switching VIP between controllers (under load).
When VIP is switched from master to backup keepalived, connections to DB are dead on host where VIP was before switch (keystone wsgi workers are all busy and waiting for DB reply).
Test env:
- 2 Controllers - Haproxy, keepalived, OS services, DB ..etc
- 2 Computes
How to reproduce:
1. Generate as big traffic as you can to replicate issue (curl token issue to keystone VIP:5000)
2. Check logs for keystone (there will be big amount of 201 on both controllers)
2. Restart keepalived OR restart networking OR ifup/ifdown interface on current keepalived master
(VIP will be switched to secondary host)
3. Check logs for keystone
4. You can see that access log for keystone is freezed (on host where VIP was before), after while there will be 503,504
Why this is happening ?
Normally when master keepalived is not reachable, secondary keepalived take VIP and send GARP to network, and all clients will refresh ARP table, so everything should work.
Problem is that wsgi processes has connectionPool to DB and these connections are dead after switch, they don't know that ARP changed (probably host refused GARP because there is very tiny window when VIP was assigned to him).
So, wsgi processes are trying to write to filedescriptor/
Above problem is solved itself after some time -> this time depends on user's kernel option net.ipv4.
Decrease tcp_retries2 to 1 fixed issue immediately.
Here is detailed article about tcp socket which are refusing to die -> https:/
RedHat is also suggesting to tune this kernel option for HA solutions as it is noted here -> https:/
"In a High Availability (HA) situation consider decreasing the setting to 3." << From RedHat
Here is also video of issue (left controller0, right contoller1, bottom logs, middle VIP monitor switch)
https:/
I will provide fix and push for review.
Changed in kolla-ansible: | |
status: | New → In Progress |
assignee: | nobody → Michal Arbet (michalarbet) |
Changed in kolla-ansible: | |
importance: | Undecided → Medium |
+ https:/ /pracucci. com/linux- tcp-rto- min-max- and-tcp- retries2. html