[L3 HA] After banning active l3-agent all health agents are still standby

Bug #1524822 reported by Kristina Berezovskaia
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Medium
Kristina Berezovskaia
9.x
Fix Released
Medium
Ann Taraday

Bug Description

After ban active l3 agent all other active agents still were standby:
root@node-4:~# neutron l3-agent-list-hosting-router router_EW
+--------------------------------------+-------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+-------------------+----------------+-------+----------+
| c51334df-5b50-404f-882f-ea74e717b44b | node-4.domain.tld | True | :-) | standby |
| 2af16bd2-3d4b-4a28-95c3-0a0cf5e248bd | node-5.domain.tld | True | xxx | active |
| 456c304e-7262-4206-a52d-4b87e4f2262b | node-3.domain.tld | True | :-) | standby |
+--------------------------------------+-------------------+----------------+-------+----------+

/var/lib/neutron/ha_confs/d20aa7e4-c009-4ae1-a7e5-88f7c209822a/state shows backup for all agents

Steps:
1) Create net1, subnet
2) Create net2, subnet
3) Create router, set gatawey and interfaces to both nets
4) boot vm1 in net1 ans associate floating
5) boot vm2 in net2
7) start ping vm1 from vm2 by floating and internal
8) ban active l3 agent
9) wait some time
10) Check ping
Expected result: ping is available
Current result: ping isn't available

Find on:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "264"
  build_id: "264"
  fuel-nailgun_sha: "0e09dce510927f2cc490b898e5fe3f813bd791be"
  python-fuelclient_sha: "f033192b84263f0e699458a4274289a5198ae7e4"
  fuel-agent_sha: "660c6514caa8f5fcd482f1cc4008a6028243e009"
  fuel-nailgun-agent_sha: "a33a58d378c117c0f509b0e7badc6f0910364154"
  astute_sha: "48fd58676debcc85951db68df6d77c22daa55e52"
  fuel-library_sha: "ab7e51f345ffb7c256e0f61addcf86553d7c3867"
  fuel-ostf_sha: "23b7ae2a1a57de5a3e1861ffb7805394ca339cc2"
  fuel-mirror_sha: "6534117233a5bdc51d7d47361bc7d511e4b11e6f"
  fuelmenu_sha: "fcb15df4fd1a790b17dd78cf675c11c279040941"
  shotgun_sha: "a0bd06508067935f2ae9be2523ed0d1717b995ce"
  network-checker_sha: "a3534f8885246afb15609c54f91d3b23d599a5b1"
  fuel-upgrade_sha: "1e894e26d4e1423a9b0d66abd6a79505f4175ff6"
  fuelmain_sha: "26adf12c320936a97a9b0a84169a6e58c530e848"
(3 controllers, 2 compute, neutron+vxlan+l3 ha)

This problem are not always reproduced.
Attach logs from l3 and server from all controllers

Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :
Changed in mos:
status: New → Confirmed
Revision history for this message
Elena Ezhova (eezhova) wrote :

Seems this needs another repro and an env where the bug was reproduced.

Changed in mos:
assignee: MOS Neutron (mos-neutron) → Kristina Kuznetsova (kkuznetsova)
status: Confirmed → Incomplete
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

No longer fixing Medium bugs in 8.0. MOS Neutron team, please give it another try in 9.0

tags: added: area-neutron
removed: neutron
Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

Reproduced one more time on 8.0. In this case I destroyed controller with active l3 agent instead of ban l3 agent.

root@node-18:~# neutron l3-agent-list-hosting-router adba5c35-f2f8-468e-8f45-4863232ec4f8
+--------------------------------------+--------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------+----------------+-------+----------+
| 2c3894f8-68c4-4865-a970-fc4f0ae50429 | node-17.domain.tld | True | xxx | active |
| 4eba6107-65bc-4616-8c00-29d427788504 | node-18.domain.tld | True | :-) | standby |
| e4e095cd-55c7-4007-8f7b-0c00008257d4 | node-16.domain.tld | True | :-) | standby |
+--------------------------------------+--------------------+----------------+-------+----------+

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "478"
  build_id: "478"
  fuel-nailgun_sha: "ae949905142507f2cb446071783731468f34a572"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "481ed135de2cb5060cac3795428625befdd1d814"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "420c6fa5f8cb51f3322d95113f783967bde9836e"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "fac143f4dfa75785758e72afbdc029693e94ff2b"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "6c6b088a3d52dd0eaf43d59f3a3a149c93a07e7e"
(l2+vxlan+l3)

Changed in mos:
status: Incomplete → Won't Fix
no longer affects: mos/8.0.x
Changed in mos:
milestone: 8.0 → 8.0-updates
Revision history for this message
Yury Tregubov (ytregubov) wrote :

The same symptoms are seen on 9.0 mitaka releases. The l3 agent is stuck in standby state even if all other l3 agents are banned:
root@node-1:~#neutron l3-agent-list-hosting-router b8da5b6c-3661-413f-b521-53d7347ddb61
+--------------------------------------+--------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| f3c0db85-1fcf-44c6-8fdf-c83bf2c6761e | node-3.test.domain.local | True | xxx | standby |
| 5a94f772-6db7-4ac0-94a2-8e9fcbf20be4 | node-2.test.domain.local | True | xxx | standby |
| 24fa0b7d-39cc-46cd-84f7-b8b691436493 | node-1.test.domain.local | True | :-) | standby |
+--------------------------------------+--------------------------+----------------+-------+----------+

The problem is reproducible. Seen already on several 9.0 mitaka iso: 59 79 and 89.

However ping between VMs created on the affected router works fine.

tags: added: keep-in-9.0
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

The situation that we see in MOS 9.0 (agents stuck in standby state, connection working fine) is caused by absence of cleanup-script, which is already on review https://review.fuel-infra.org/#/c/18773/, as soon it will be merged this should be fixed.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :
Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

Verify on 9.0
cat /etc/fuel_build_id:
 355
cat /etc/fuel_build_number:
 355
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6345.noarch
 fuel-bootstrap-cli-9.0.0-1.mos282.noarch
 fuel-migrate-9.0.0-1.mos8383.noarch
 rubygem-astute-9.0.0-1.mos745.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8704.noarch
 network-checker-9.0.0-1.mos72.x86_64
 fuel-mirror-9.0.0-1.mos136.noarch
 fuel-openstack-metadata-9.0.0-1.mos8704.noarch
 fuel-notify-9.0.0-1.mos8383.noarch
 nailgun-mcagents-9.0.0-1.mos745.noarch
 python-fuelclient-9.0.0-1.mos315.noarch
 fuelmenu-9.0.0-1.mos270.noarch
 fuel-9.0.0-1.mos6345.noarch
 fuel-utils-9.0.0-1.mos8383.noarch
 fuel-setup-9.0.0-1.mos6345.noarch
 fuel-library9.0-9.0.0-1.mos8383.noarch
 shotgun-9.0.0-1.mos88.noarch
 fuel-agent-9.0.0-1.mos282.noarch
 fuel-ui-9.0.0-1.mos2695.noarch
 fuel-ostf-9.0.0-1.mos934.noarch
 fuel-misc-9.0.0-1.mos8383.noarch
 python-packetary-9.0.0-1.mos136.noarch
 fuel-nailgun-9.0.0-1.mos8704.noarch
(vxlan+l2+l3, 3 controller and 2 compute)

Steps:
1) Create net1, subnet
2) Create net2, subnet
3) Create router, set gatawey and interfaces to both nets
4) boot vm1 in net1 and associate floating
5) boot vm2 in net2
7) start ping vm1 from vm2 by floating and internal
8) ban active l3 agent
9) wait some time
10) Check ping and status ACTIVE for one of other agents
11) Ban one more active agent
12) Check ping and status ACTIVE for another agent
13) Clear 2 banned agents
14) Repeat steps 8-13 sevral times
Ping is available and one agent is always in Active state

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.