Bug #1827397 “charm should validate it can reach memcache” : Bugs : OpenStack Nova Cloud Controller Charm

Sahid Orentino (sahid-ferdjaoui) on 2019-05-03

Changed in charm-nova-cloud-controller:
status:	New → Triaged
importance:	Undecided → Wishlist

Alvaro Uria (aluria) on 2019-05-03

tags:

added: canonical-bootstack

Revision history for this message

Alvaro Uria (aluria) wrote on 2019-05-03:

#1

This "known issue" degraded a charm-upgraded Cloud considerably. Because of haproxy-*-timeout, "openstack service list" (and the same via Horizon) aborted the connection before finishing. And there was a point where no logs identified this as an issue.

In retrospective, greping for "memcache" only shows the values after a nova-api-os-compute restart (in debug mode).

I may agree that the nova-cloud-controller message in "juju status" could highlight this connectivity issue, but I also think Nova logs should describe (at least in debug mode) that the connections to the configured memcached servers are getting refused (we may not know per the logs that it is caused by the juju bindings, but at least the problem won't be hidden).

Revision history for this message

Alvaro Uria (aluria) wrote on 2019-05-03:

#2

Cloud is 19.04, xenial-queens.

Dominique Poulain (dominique-poulain) on 2020-04-19

tags:

added: sts

Revision history for this message

Felipe Reyes (freyes) wrote on 2020-04-21:

#3

About logging the connection failures, python-memcached library is not using logging library, it simply does a sys.stderr.write()[0] if the debug=1 is passed when instantiating the client[1]

[0] https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L403
[1] https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L170
[2] https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L1400-L1409

Revision history for this message

Trent Lloyd (lathiat) wrote on 2020-04-28:

#4

Did some research into this and specificlaly why nova gets so hung up rather than just timing out. A few findings:

- haproxy cuts the client connections after 30 seconds, terminating the connection to nova. (consider revisiting)

- The memcached charm sets the rules to 'drop' instead of 'reject', so anything that can't connect hangs. (will change)

- There was a previous bug related https://bugs.launchpad.net/oslo.cache/+bug/1731921 - the memcached_socket_timeout was set to 1.0s down from 3.0s - this was added in 1.37.0 which is in train+ and eoan+ - this triples the total timeout time unless you are on train which pretty much no-one is. In this case you would only need 10 cache operations to reach the 30 second limit

- Each operation incurs the socket timeout seemingly even if a server is marked dead (need to re-confirm that). It's likely nova does many operations although I haven't checked/tested how many. TBD. Tested using the app from https://github.com/4383/oslo.cache-labs .

- Some people in the above bug suggest you can easily set this as low as 0.25 seconds which makes sense (consider revisiting)

- memcache_pool (which we use) does keep track of dead servers however each of your workers and each of that workers worker threads all maintain their own state for this and so as you cycle through each thread you re-discover this issue. this could also impact failovers from one memcache server to another it can take a while for the servers to warm to this.

- there is another patch that needs further research: https://bugs.launchpad.net/oslo.cache/+bug/1812935 - it's unclear if this is correctly patched in all version of ubuntu and need to look into it further (note comment: 'It may feel a bit misleading to mark disco and stein as fix released at this point') - could affect some clouds that hit this depending on their package version. need to test with/without that patch and/or check the affected clouds versions.

Still looking into it further just wanted to drop some interim notes.