tripleo-ci-centos-7-undercloud-containers timing out in host configuration for step 2

Bug #1813900 reported by Rabi Mishra
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Cédric Jeanneret
Tags: ci
Rabi Mishra (rabi)
summary: - tripleo-ci-centos-7-undercloud-containers timing out for host
+ tripleo-ci-centos-7-undercloud-containers timing out in host
configuration for step 2
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

https://github.com/openstack/puppet-tripleo/commit/f25c27aa2c6eff327d612d163c1758b59618d6ed just showed the real issue.

The issue is the lack of tag in https://review.openstack.org/#/c/631784/

If we check the code in puppet-tripleo[1], we can see the dependence chain is using a specific tag for the firewall rules, tag value being "tripleo-firewall-rule".

We can see that behavior with the following "chain" of rules:
2019-01-30 00:54:40.085 18172 WARNING tripleoclient.v1.tripleo_deploy.Deploy [ ] "Notice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv4]/ensure: created",
2019-01-30 00:54:40.085 18172 WARNING tripleoclient.v1.tripleo_deploy.Deploy [ ] "Notice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv6]/ensure: created",
2019-01-30 00:54:40.086 18172 WARNING tripleoclient.v1.tripleo_deploy.Deploy [ ] "Notice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[001 accept all icmp]/Firewall[001 accept all icmp ipv4]/ensure: created",
2019-01-30 00:54:40.086 18172 WARNING tripleoclient.v1.tripleo_deploy.Deploy [ ] "Notice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[001 accept all icmp]/Firewall[001 accept all icmp ipv6]/ensure: created",
2019-01-30 00:54:40.087 18172 WARNING tripleoclient.v1.tripleo_deploy.Deploy [ ] "Notice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[002 accept all to lo interface]/Firewall[002 accept all to lo interface ipv4]/ensure: created",
2019-01-30 00:54:40.087 18172 WARNING tripleoclient.v1.tripleo_deploy.Deploy [ ] "Notice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[002 accept all to lo interface]/Firewall[002 accept all to lo interface ipv6]/ensure: created",
2019-01-30 00:54:40.088 18172 WARNING tripleoclient.v1.tripleo_deploy.Deploy [ ] "Notice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[004 accept ipv6 dhcpv6]/Firewall[004 accept ipv6 dhcpv6 ipv6]/ensure: created",
20

We see the missing "003" ruleset - there are usually two rules in there, one allowing SSH from ctlplane (for the overcloud), and the other from everywhere (for the undercloud) - the latter depends on a variable, as explain in this patch commit message:https://review.openstack.org/#/c/631784/
Default value is "true" for the variable.

[1] https://github.com/openstack/puppet-tripleo/blob/master/manifests/firewall.pp#L133-L135

Changed in tripleo:
assignee: nobody → Cédric Jeanneret (cjeanner)
status: New → Triaged
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Some more digging shows the following:
- tag is being added directly within tripleo::firewall::rule, so my patch isn't of any use

- the 003 rule is being added later:
2019-01-29 21:48:11 | "Notice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[sshd]/Tripleo::Firewall::Rule[003 accept ssh from all]/Firewall[003 accept ssh from all ipv4]/ensure: created",
2019-01-29 21:48:11 | "Notice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[sshd]/Tripleo::Firewall::Rule[003 accept ssh from all]/Firewall[003 accept ssh from all ipv6]/ensure: created",

- this addition is done before the "DROP" rules

- we can actually see them in the right place in the network logs:
http://logs.openstack.org/98/604298/204/check/tripleo-ci-centos-7-undercloud-containers/5a92241/logs/undercloud/var/log/extra/network.txt.gz

This means the issue is probably not related to the firewall.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Cédric Jeanneret (<email address hidden>) on branch: master
Review: https://review.openstack.org/633897
Reason: tag is set within tripleo::firewall::rule directly.

Revision history for this message
Rabi Mishra (rabi) wrote :
Download full text (3.7 KiB)

Some other errors in the journal log which seem suspect at the time of issue. Though look like haproxy related, it's still somewhat network connectivity issue?

Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: Server barbican/undercloud.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: proxy barbican has no server available!
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: Server glance_api/undercloud.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: proxy glance_api has no server available!
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: Server heat_api/undercloud.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: proxy heat_api has no server available!
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: Server heat_cfn/undercloud.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: proxy heat_cfn has no server available!
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: Server ironic/undercloud.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: proxy ironic has no server available!
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: Server ironic-inspector/undercloud.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: proxy ironic-inspector has no server available!
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: Server keystone_admin/undercloud.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: proxy keystone_admin has no server available!
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: Server keystone_public/undercloud.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jan 30 00:59:46 undercloud.localdomain haproxy[55424]: pro...

Read more...

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

HAProxy configuration is pushed and activated before the backend containers are actually started. We can see the exact same things for "working" deploys. Service containers are started at step 3 iirc

Revision history for this message
Rabi Mishra (rabi) wrote :

Though I could not reproduce it with the reproducer, I could see below in the journal log[1]

Jan 30 09:09:14 undercloud.localdomain podman[53605]: haproxy-systemd-wrapper: SIGTERM -> 13.
Jan 30 09:09:14 undercloud.localdomain podman[53605]: haproxy-systemd-wrapper: exit, haproxy RC=0

Which seems weird.

[1] http://logs.openstack.org/98/604298/204/check/tripleo-ci-centos-7-undercloud-containers/a3648dd/logs/undercloud/var/log/journal.txt.gz#_Jan_30_09_09_14

Changed in tripleo:
milestone: none → stein-3
tags: added: alert ci
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

> Jan 30 09:09:14 undercloud.localdomain podman[53605]: haproxy-systemd-wrapper: SIGTERM -> 13.
Jan 30 09:09:14 undercloud.localdomain podman[53605]: haproxy-systemd-wrapper: exit, haproxy RC=0

Is it http://git.openstack.org/cgit/openstack/puppet-tripleo/tree/files/certmonger-haproxy-refresh.sh#n46 does that via "SIGHUP" sent to the systemd wrapper?

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Not sure the haproxy sigterm is an issue.
I'm pushing this change[1] in order to get ip6tables state, as it *might* be the issue since ipv6 is usually preferred over ipv6 for connections, i.e. ansible to localhost.
But even if ip6tables isn't OK (and I'm pretty sure it's OK), we should get a connection error from ansible.

IIRC, that step 2 thing is still host configuration, meaning "no container involved", "only puppet deploy on the host".... So that's pretty weird.

[1] https://review.openstack.org/#/c/633922/

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Apparently patch https://review.openstack.org/#/c/633944/ solved the issue. Kudos to jaosorior :)

tags: removed: alert
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Gate's all happy now.

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.