We have had many questions about how Clearwater handles the failure of individual node instances without disruption to the services it supports. Usually, these questions revolve around how Clearwater can process SIP requests statelessly, given that some long-lived state (such as registration state) is essential to the operation of an IMS core. The material presented here aims to provide answers to those questions, and to provide some clarity around what happens when Clearwater node instances fail – as they inevitably will from time to time.
SIP State Definitions
The SIP protocol defines two types of state: transaction state and dialog state.
A SIP transaction is initiated by a SIP request and is terminated by a final response. For example, a SIP transaction may begin with an INVITE request, and the transaction is terminated by a final response such as 200 OK or 404 Not Found (although technically in the latter case the transaction also includes the subsequent ACK). SIP transactions are therefore relatively short-lived.
A SIP dialog is initiated by certain types of SIP request such as INVITE or SUBSCRIBE. A SIP dialog continues until terminated by certain types of SIP request such as BYE. SIP dialogs generally persist for substantial periods of time, e.g. for the duration of voice call.
In general, Clearwater nodes store transaction state locally. The loss of a node instance therefore means a loss of transaction state, and therefore the failure of any outstanding transactions in progress on that node instance. This is consistent with the fault tolerance techniques that are commonly used in call processing systems. Calls are generally not protected against the failure of any component or sub-system until they have progressed to an established (i.e. connected) state.
Clearwater nodes never store dialog state, therefore the loss of any Clearwater node instance does not result in the loss of any dialogs.
Before getting into the details of call flows, we need to describe the way that Clearwater uses clustering and stores the necessary state. The diagram below illustrates the clustering and state storage architecture of Clearwater.
Note that we are assuming the use of an external P-CSCF and I-BCF, so the bono nodes are not present. The flows shown below illustrate the behaviour when Clearwater is configured to optimise out I-CSCF lookups of the HSS for performance reasons, and Rf billing is not configured, so ralf nodes are also not present.
The components are as follows.
- The Home Subscriber Service (HSS) is the master store for all subscriber data. It exposes a Cx/Diameter interface for retrieving and updating data.
- Homestead is Clearwater’s subscriber data cache. It exposes an HTTP interface for sprout to retrieve subscriber data. It is a cluster of 2 or more nodes and comprises
- one homestead software instance per node
- a fault-tolerant data store distributed across all the nodes, caching subscriber data retrieved from the HSS via the Cx/Diameter interface.
- Sprout is Clearwater’s SIP router. It receives requests from the P-CSCF and unsolicited requests from ASs, queries subscriber data from homestead, routes via the ISC (SIP) interface to ASs, performs ENUM queries, acts as a registrar, and routes SIP requests towards registered endpoints. It is a cluster of 2 or more nodes and comprises
- one sprout software instance per node, each of which independently maintains a store of active transactions
- a fault-tolerant data store distributed across all the nodes, storing registration data.
- The ENUM server simply responds to ENUM (DNS) queries from sprout to map telephone numbers to SIP URIs.
- The Application Server (AS) provides additional services, such as local and national dialing plans and call services. It receives requests from sprout and either handles them itself or passes them back to sprout for further processing. It may also send unsolicited requests – these go to sprout.
Communication and State
Sprout and homestead both consist of clusters of nodes. Each node has an IP address (and possibly a domain name). Additionally, each cluster has a cluster domain name, which resolves to all the IP addresses in the cluster.
The homestead cluster domain name is only used within Clearwater itself, with the sprout nodes communicating with the homestead cluster domain name.
The sprout cluster domain name is used by the P-CSCF and AS. The P-CSCF and AS must support RFC 3263, which covers DNS round-robin behavior. If a transaction the P-CSCF or AS sends towards the sprout cluster fails because the chosen sprout node does not respond, the P-CSCF or AS must retry the transaction to a different sprout node.
Sprout nodes are transaction-stateful but not dialog-stateful. If a request flows through a specific sprout, the corresponding response(s) must also flow through that sprout. Each sprout node includes its own IP address in Via headers to ensure that responses are routed back via the same sprout node as the request.
The sprout cluster stays in the dialog’s signaling path by record-routing itself using the cluster domain name. The cluster domain name resolves to all of the sprout IP addresses, so there is no need for a subsequent in-dialog transaction to be processed by the same sprout instance as the original dialog-initiating transaction.
Homestead processes HTTP requests, not SIP requests. It is technically HTTP-transaction-stateful, but since it always responds to HTTP requests itself, these transactions are very short-lived (generally low tens of milliseconds).
Before diving into the detail of what happens on node failure, consider the normal flows when
- processing dialogs, including initiation, in-dialog requests and termination
- processing out-of-dialog requests such as MESSAGE.
Dialog Initiation (on-net)
Behavior on Node Failure
If a sprout node fails before receiving a request from the P-CSCF, the request fails and the P-CSCF retries.
If a sprout node fails after a dialog is established, again the P-CSCF or UE retries – the Route header specifies the sprout cluster, so the P-CSCF applies round-robin DNS processing to them.
If a sprout node fails while a transaction is in progress, the transaction fails. Either the UE will retry automatically or it will display an error to the user, who should retry.
This transaction failure scenario occurs even if the only operation the sprout node has performed is sending the 100 Trying response to an INVITE, as from that point on the P-CSCF will not retry the transaction.
All of the above apply equally when the request is an unsolicited request sent by an AS, rather than a request from a P-CSCF.
Sprout uses a distributed, fault-tolerant registration store. As a result, if a user registers via one sprout node and that sprout node then fails, subsequent messages for that user can still be routed by any other sprout node.
Note that sprout also has interfaces to the ENUM server and homestead. In both cases, sprout issues requests to the ENUM server or homestead and waits for responses before continuing processing a SIP request. As a result, a sprout failure in either case is equivalent to the sprout node failing while processing the SIP request, resulting in transaction failure back to the UE.
Homestead’s HTTP interface is simple. If one homestead instance does not respond to sprout, sprout tries a different one.