charm should validate it can reach memcache

Bug #1827397 reported by James Troup
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Triaged
Wishlist
Unassigned

Bug Description

The 19.04 nova-cloud-controller charm introduced a dependency on memcache and, due to LP #1823740, it's super easy to end up in the situation where you have the memcache relation but n-c-c is not able to actually talk to the memcache. Diagnosing this is non-trivial and while, yes, it's documented in the release notes, wherever possible we should not rely on charm users reading release notes before upgrading. Apart from anything else, they're not exactly discoverable, e.g. I can find no mention of them on either:

  https://jujucharms.com/nova-cloud-controller/327
  https://github.com/openstack/charm-nova-cloud-controller

And even when you find them, I had to page down 11 times before I got to any mention of n-c-c and memcache.

Please make the charm check connectivity to memcache and do something useful (e.g. warn the user in juju status and/or set status to blocked).

Changed in charm-nova-cloud-controller:
status: New → Triaged
importance: Undecided → Wishlist
Alvaro Uria (aluria)
tags: added: canonical-bootstack
Revision history for this message
Alvaro Uria (aluria) wrote :

This "known issue" degraded a charm-upgraded Cloud considerably. Because of haproxy-*-timeout, "openstack service list" (and the same via Horizon) aborted the connection before finishing. And there was a point where no logs identified this as an issue.

In retrospective, greping for "memcache" only shows the values after a nova-api-os-compute restart (in debug mode).

I may agree that the nova-cloud-controller message in "juju status" could highlight this connectivity issue, but I also think Nova logs should describe (at least in debug mode) that the connections to the configured memcached servers are getting refused (we may not know per the logs that it is caused by the juju bindings, but at least the problem won't be hidden).

Revision history for this message
Alvaro Uria (aluria) wrote :

Cloud is 19.04, xenial-queens.

tags: added: sts
Revision history for this message
Felipe Reyes (freyes) wrote :

About logging the connection failures, python-memcached library is not using logging library, it simply does a sys.stderr.write()[0] if the debug=1 is passed when instantiating the client[1]

[0] https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L403
[1] https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L170
[2] https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L1400-L1409

Revision history for this message
Trent Lloyd (lathiat) wrote :

Did some research into this and specificlaly why nova gets so hung up rather than just timing out. A few findings:

 - haproxy cuts the client connections after 30 seconds, terminating the connection to nova. (consider revisiting)

 - The memcached charm sets the rules to 'drop' instead of 'reject', so anything that can't connect hangs. (will change)

 - There was a previous bug related https://bugs.launchpad.net/oslo.cache/+bug/1731921 - the memcached_socket_timeout was set to 1.0s down from 3.0s - this was added in 1.37.0 which is in train+ and eoan+ - this triples the total timeout time unless you are on train which pretty much no-one is. In this case you would only need 10 cache operations to reach the 30 second limit

  - Each operation incurs the socket timeout seemingly even if a server is marked dead (need to re-confirm that). It's likely nova does many operations although I haven't checked/tested how many. TBD. Tested using the app from https://github.com/4383/oslo.cache-labs .

 - Some people in the above bug suggest you can easily set this as low as 0.25 seconds which makes sense (consider revisiting)

 - memcache_pool (which we use) does keep track of dead servers however each of your workers and each of that workers worker threads all maintain their own state for this and so as you cycle through each thread you re-discover this issue. this could also impact failovers from one memcache server to another it can take a while for the servers to warm to this.

 - there is another patch that needs further research: https://bugs.launchpad.net/oslo.cache/+bug/1812935 - it's unclear if this is correctly patched in all version of ubuntu and need to look into it further (note comment: 'It may feel a bit misleading to mark disco and stein as fix released at this point') - could affect some clouds that hit this depending on their package version. need to test with/without that patch and/or check the affected clouds versions.

Still looking into it further just wanted to drop some interim notes.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.