In rare cases L3 wasn't rescheduled after destroying controller

Bug #1545756 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
MOS Maintenance
8.0.x
Invalid
High
MOS Maintenance
9.x
Fix Released
High
Oleg Bondarev

Bug Description

8.0 iso - 566

Scenario:
1. Deploy next cluster - Neutron Vxlan, all default other settings, 3 controller, 2 compute, 1 cinder nodes
2. Create an instance with a key pair
3. Manually reschedule router from primary controller
to another one (from node-3 to node-4)
4. Destroy controller with l3-agent (node-4)
5. Wait all HA OSTF tests passed
6. Check l3-agent was rescheduled

Actual result - it's still hosted by dead agent

 neutron l3-agent-list-hosting-router 67d98ea2-7332-46ce-84ff-61a693f135b1
+--------------------------------------+--------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 75362c69-b5f2-420b-bece-e952ea7a0b92 | node-4.test.domain.local | True | xxx | |
+--------------------------------------+--------------------------+----------------+-------+----------+

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
assignee: nobody → MOS Neutron (mos-neutron)
Changed in fuel:
assignee: MOS Neutron (mos-neutron) → Oleg Bondarev (obondarev)
Changed in fuel:
status: New → Confirmed
tags: added: area-neutron
Revision history for this message
Oleg Bondarev (obondarev) wrote :

For some reason the looping "rescheduling" task in running neutron server stops working. It should check for down bindings and reschedule routers from down agents. Restarting neutron server fixes the issue and router is rescheduled - this means that it's not related to wrong db logic on identifying dead bindings but something with looping task itself, which stops working or hanging. Need to investigate further

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Workaround - restart neutron server on any controller.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

No info in logs which can help debugging. We need another repro to inspect greenthreads of running neutron server to see what happens with rescheduling looping task. Moving to incomplete for now.

Changed in fuel:
status: Confirmed → Incomplete
assignee: Oleg Bondarev (obondarev) → Andrey Sledzinskiy (asledzinskiy)
Revision history for this message
Oleg Bondarev (obondarev) wrote :
Download full text (7.2 KiB)

I was wrong, logs has everything to identify the issue:

2016-02-15T10:44:44.259995+00:00 err: 2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall [req-79bce4c3-2e81-446c-8b37-6d30e3a964e2 - - - - -] Fixed interval looping call 'neutron.services.l3_router.l3_router_plugin.L3RouterPlugin.reschedule_routers_from_down_agents' failed
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall Traceback (most recent call last):
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 113, in _run_loop
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 101, in reschedule_routers_from_down_agents
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall down_bindings = self._get_down_bindings(context, cutoff)
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/neutron/db/l3_dvrscheduler_db.py", line 460, in _get_down_bindings
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall context, cutoff)
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 149, in _get_down_bindings
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall return query.all()
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2399, in all
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall return list(self)
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2516, in __iter__
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall return self._execute_and_instances(context)
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2529, in _execute_and_instances
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall close_with_result=True)
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2520, in _connection_from_session
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall **kw)
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 882, in connection
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall execution_options=execution_options)
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 889, in _connection_for_bind
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall conn = engine.contextual_connect(**kw)
2016-02-15 10:44:44.250 15419 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2039, in contextual_connect
2016-02-15 10:44:44...

Read more...

Changed in fuel:
status: Incomplete → Confirmed
assignee: Andrey Sledzinskiy (asledzinskiy) → Oleg Bondarev (obondarev)
Revision history for this message
Oleg Bondarev (obondarev) wrote :

So the issue happens because of a db failure which is not handled on neutron side and rescheduling task just exits - will be fixed in neutron. However my concern is that it didn't happen before - which might mean that db failures started happen recently when shutting down one of controllers.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Since there is a simple workaround and the bug is not 100% reproducible moving it to 8.0-updates.

Revision history for this message
Oleg Bondarev (obondarev) wrote :
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Upstream fix: https://review.openstack.org/280753 need to be backported to 8.0-updates once merges

tags: added: move-to-mu
tags: added: release-notes
summary: - L3 wasn't rescheduled after destroying controller
+ In rare L3 wasn't rescheduled after destroying controller
summary: - In rare L3 wasn't rescheduled after destroying controller
+ In rare cases L3 wasn't rescheduled after destroying controller
tags: added: 8.0 release-notes-done
removed: release-notes
no longer affects: fuel
no longer affects: fuel/8.0.x
tags: added: wait-for-stable
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/19544
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 9e90051f07e35d29848e7f679a02d0cd992d541a
Author: Jenkins <email address hidden>
Date: Wed Apr 13 13:22:17 2016

Merge the tip of origin/stable/mitaka into origin/9.0/mitaka

05a4a34 Notify resource_versions from agents only when needed
fff909e Values for [ml2]/physical_network_mtus should not be unique
6814411 Imported Translations from Zanata
a2d1c46 firewall: don't warn about a driver that does not accept bridge
fa5eb53 Add uselist=True to subnet rbac_entries relationship
c178bd9 Fix race conditions in IP availability API tests
ee32ea5 Switched from fixtures to mock to mock out starting RPC consumers
77696d8 Imported Translations from Zanata
3190494 Fix zuul_cloner errors during tox job setup
04fb147 Refactor and fix dummy process fixture
844cae4 Switches metering agent to stateless iptables
19ea6ba Remove obsolete keepalived PID files before start
aafa702 Add IPAllocation object to session info to stop GC
005d49d Ensure metadata agent doesn't use SSL for UNIX socket
905fd05 DVR: Increase the link-local address pair range
93d719a SG protocol validation to allow numbers or names
33d3b8c L3 agent: match format used by iptables
7b2fcaa Use right class method in IP availability tests
93cdf8e Make L3 HA interface creation concurrency safe
d934669 ovsfw: Remove vlan tag before injecting packets to port
33c01f4 Imported Translations from Zanata
05ac012 test_network_ip_availability: Skip IPv6 tests when configured so
38894cc Retry updating agents table in case of deadlock
aac460b Allow to use several nics for physnet with SR-IOV
90b9cd3 port security: gracefully handle resources with no bindings
7174bc4 Ignore exception when deleting linux bridge if doesn't exist
93d29d1 Don't delete br-int to br-tun patch on startup
211e0a6 functional: Update ref used from ovs branch-2.5.
c6ef57a ovs-fw: Mark conntrack entries invalid if no rule is matched
ef6ea62 l3: Send notify on router_create when ext gw is specified
eb8ddb9 Move db query to fetch down bindings under try/except
da1eee3 Close XenAPI sessions in neutron-rootwrap-xen-dom0
1d51172 Watch for 'new' events in ovsdb monitor for ofport
bd3e9c3 Removes host file contents from DHCP agent logs

Closes-Bug: #1569735
Closes-Bug: #1569738
Closes-Bug: #1539664
Closes-Bug: #1560727
Closes-Bug: #1558613
Closes-Bug: #1545756

Change-Id: Ia30076744f13666f950fee78a86a8c81f7207206

tags: added: on-verification
Revision history for this message
Ivan Berezovskiy (iberezovskiy) wrote :

verified on
[root@fuel ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 364
cat /etc/fuel_build_number:
 364
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6345.noarch
 fuel-misc-9.0.0-1.mos8360.noarch
 network-checker-9.0.0-1.mos72.x86_64
 fuel-mirror-9.0.0-1.mos135.noarch
 fuel-openstack-metadata-9.0.0-1.mos8683.noarch
 fuel-notify-9.0.0-1.mos8360.noarch
 fuel-ostf-9.0.0-1.mos934.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8683.noarch
 python-fuelclient-9.0.0-1.mos315.noarch
 fuelmenu-9.0.0-1.mos270.noarch
 fuel-9.0.0-1.mos6345.noarch
 fuel-utils-9.0.0-1.mos8360.noarch
 fuel-nailgun-9.0.0-1.mos8683.noarch
 rubygem-astute-9.0.0-1.mos742.noarch
 fuel-library9.0-9.0.0-1.mos8360.noarch
 shotgun-9.0.0-1.mos88.noarch
 fuel-agent-9.0.0-1.mos277.noarch
 fuel-ui-9.0.0-1.mos2685.noarch
 fuel-setup-9.0.0-1.mos6345.noarch
 nailgun-mcagents-9.0.0-1.mos742.noarch
 python-packetary-9.0.0-1.mos135.noarch
 fuel-bootstrap-cli-9.0.0-1.mos277.noarch
 fuel-migrate-9.0.0-1.mos8360.noarch

also we have this case automated.
You could find results in test rail. For example:
https://mirantis.testrail.com/index.php?/runs/view/10973&group_by=cases:section_id&group_order=asc

tags: removed: on-verification
Revision history for this message
Sergii Rizvan (srizvan) wrote :

The fix was added for 8.0 with merging the tip of origin/stable/liberty into origin/openstack-ci/fuel-8.0/liberty. That's why I'm going to set status for 8.0 as Invalid.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.