Quiesce – Stop/starting in a cloud native world

Whether during upgrade, or for any other reason, there will be times when a service provider requires a running process to be stopped and/or restarted or running nodes shutdown and/or rebooted. But how can this be done with any interruption to service?

In a naïve implementation, a single node might contain state required for the duration of a call (dialog state). This will cause a problem. Shutting down the node will cause the state to be lost and the call to fail, for example during a SIP re-INVITE. Waiting for all calls to end could take a very long time and significantly delay the required restart.

Project Clearwater Sprout nodes do not store any dialog or other long-lived state locally. This provides a significant advantage. Only short-lived state during a single SIP transaction is stored. For more information on transaction state and dialog state, visit http://www.projectclearwater.org/technical/call-flows/. But whereas during occasional node failure, the loss of a handful of ongoing transactions is consistent with standard fault tolerance techniques; during routine upgrades and config changes this is not acceptable and we must ensure absolutely no interruption to service.

Continuation of service is achieved with a period of quiescing prior to a full stop. During this period, the node rejects any new transactions (which are handled instead by other nodes in the cluster), but continues to process all existing transactions. Once existing transactions have terminated, we are in a position to stop the sprout process and/or restart the node without any loss.

In a cloud native environment, it might be necessary to expand on this behaviour. For example, it is important that the process continues to respond to a monitoring system, so that we do not conclude that the process has unexpectedly died during the period of quiescing which might lead to the forced termination of a still-running node. In addition, requests from a TAS with an ODI token parameter, can only be handled by the local sprout and must also be allowed while quiescing.

Mark Perryman

Leave a Reply


captcha *