Don't use scheduler to run actions

Bug #1630508 reported by Renat Akhmerov
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fix Released
Renat Akhmerov

Bug Description

Now when Mistral engine needs to run an action it does it via Scheduler. It creates a delayed call so that Scheduler picks it up in a different transaction and make the actual call.

The intention was:
1) To have a transactional guarantee as far as scheduling actions. If something fails before within a DB transaction after we called (scheduled) an action then the corresponding delayed call will also be rolled back and action won't run.
2) To avoid having a concurrency window that we would have if we were running actions after the main DB transaction (e.g. created by method on_action_complete()). If Mistral engine crashes right after DB transaction but before calling actions via RPC we would lose the information about what calls have been made and what have not.

Practically, Scheduler works with delays. It has iterations that run every N seconds (1 second in the current version). It leads to a significant loss of performance. On large workflows with lots of transitions and joins the difference in total workflow run time maybe up to 30%. It is a big deal in some use cases.
Additionally, even though we assume that Scheduler should take care of reliable delayed calls processing (to avoid situation described in #2), in fact, it doesn't do that because, if Scheduler itself dies then we may lose the info about what has been called and what's not.

Proposal: Change Mistral engine so that it only collects RPC calls that it needs to make while a DB transaction open. It could collect it in some thread local storage, for example. After the DB transaction is committed successfully it should make the actual calls via RPC. That way we would eliminate the performance problem. In this case, we'll have a concurrency window described in #2. In order to deal with relatively rare situations when we have to recover from it we'll need to use some other mechanism which should involve a human interaction. For example, we could have a maintenance mode in Mistral in which we could manually finish tasks stuck in RUNNING state or re-run them. In any case, it's a bigger task which also includes other failure situations.

Changed in mistral:
milestone: none → ocata-1
assignee: nobody → Renat Akhmerov (rakhmerov)
importance: Undecided → High
status: New → Confirmed
Changed in mistral:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.