[QA] Add test which will stop/kill cluster services randomly.

Bug #1645313 reported by Denis Klepikov
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Fuel QA Team

Bug Description

To increase clusters stability please create test which will kill or stop cluster services randomly under cluster load https://blueprints.launchpad.net/fuel/+spec/create-load-before-tests.

What services should be stopped/killed: all OpenStack services 1 service at one time.

What we should to control during this test:
Stopped/killed service back online if we have automatically recovering procedure
Cloud monitoring system successfully reported about problem with detailed explanation what service on what cluster nodes failed. In case of automatically recovering procedure exists - cloud monitoring system should report about recovering.

Test log should contain time-stamp when the stop/kill command was sended (what service on what node), time-stamp when service was stopped (what service on what node), time-stamp when cloud monitoring system was able to report a problem (what service on what node), time-stamp when service was recovered (if automatically recovering procedure exists) (what service on what node).

Time difference in seconds between points p1-p2, p2-p3, p3-p4, p1-p3 should be logged too:
point 1 - service was stopped
point 2 - cloud monitoring system was able to report a problem
point 3 - service was recovered (if automatically recovering procedure exists) or manually (only for services without automatically recovering procedure)
point 4 - cloud monitoring system reported about service recovering

For services with automatically recovering procedure time difference should be p1-p2<p1-p3.

In case if some services do not have automatically recovering procedure - service should be started back by this test only after cloud monitoring system reported a problem related to this service.

What is the profit?
This test will help up to check:
Do all services recovered as expected?
Does service’s recovering time expected?
What is the time-shift of automatically recovering for each service?
Does cloud monitoring system report us about issues into cloud (what service on what node)?
What is the time-shift between real problem and reporting (what service on what node)?
What is the time-shift between service recovering and reporting (what service on what node)?

tags: added: support
description: updated
Changed in fuel:
milestone: none → 9.x-updates
importance: Undecided → Medium
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.