Recovering from Network Outages

This blog post describes what happens if transport level connectivity issues occur between a UCMA application which hosts at least one application endpoint and the associated Lync Front-End (FE) Server. It also describes know how to gracefully recover from such network outages.

There are 2 mechanisms to minimize the consequences of networking outages:

  • Application pool: you can setup an application pool to run multiple instances from the same application on multiple hosts. If there is network connectivity issue between one of the application instances and the FE pool then Lync Server will deliver incoming calls to the other instances automatically.
  • Endpoint recovery: UCMA stack implements an endpoint recovery mechanism. Using this mechanism it is able to automatically reconnect an endpoint to FE under the hood if network issue is temporary.
     

This blog post analyzes the recovery mechanism implemented by the UCMA stack for application endpoints. It describes how it works, the circumstances when it helps and what applications should do when it does not help.

Of course, internal UCMA implementation details are not known for me. All of the information you can find below is extracted from log files captured using the OCSLogger tool on an UCMA 3.0 application server which registers a single application endpoint (ApplicationEndpointSettings.UseRegistration = true) to a Lync Standard Edition Server. The OCSLogger tool can be configured to collect the log entries which are related to UCMA layer and show when the stack tries connect to the FE or reregister an application endpoint.

Application endpoint states

UCMA application endpoints can be in the following states: Idle, Establishing, Established, Reestablishing, Draining, Terminating and Terminated. Endpoints offer StateChanged event to notify subscribers each time endpoint state changes.

The entire state change machine (finite state machine; FSM) is well documented here. This blog post deals only with those states and state transitions which are directly related to networking outages. These state transitions are indicated in the picture below.

app_ep_fsm.PNG

The FSM state transitions which are directly related to network outages are the followings:

  • Established => Reestablishing => Established
  • Established => Reestablishing => Terminating

State transition 4: Established => Reestablishing

According to UCMA stack traces, the stack always detects network outage in 200 seconds. The stack sends TCP packets to the FE periodically. The UCMA stack detects network outage in 200 seconds.  Even if the application does not perform any UCMA related activity. When it detects the network outage, the application endpoint state changes from Established to Reestablishing.

State transition 5: Reestablishing => Established

In Reestablishing state, the UCMA stack tries to reconnect to the FE periodically using gradually increasing time intervals. Then it tries to reregister the endpoint the number of times it was specified by ApplicationEndpointSettings.MaxRegisterRetries (its default is 1). If the network connectivity issue is solved within a 10 minutes timeframe then the stack detects this in 2 minutes and reregisters the endpoint automatically. When it is done, the application endpoint state changes from Reestablishing to Established.

State transition 9: Reestablishing => Terminating

If the stack does not manage to reconnect and reregister the application endpoint then endpoint state changes from Reestablishing to Terminating after 10 + 2 * MaxRegisterRetries minutes. Then it moves to Terminated state immediately. After reaching the Terminated state, the application endpoint will never be reestablished by the UCMA stack again. Application logic needs to take care of further endpoint recovery. It can e.g. try to reestablish the endpoint by invoking the ApplicationEndpoint.BeginEstablish() method periodically.

Consequences

All this means that a temporary network outage which takes less than 2 minutes might be completely invisible for the UCMA application if the application does not perform any UCMA related activity. UCMA application sees the endpoint in Established state in this case. However, callers will not be able to connect to the application endpoint for Tno minutes, where Tno outage denotes the exact duration of the network outage.

Network outage which lasts more than 2 minutes but less than 10 + 2 * MaxRegisterRetries minutes causes Established => Reestablishing state transition first. Then application endpoint state changes back to Established state. UCMA traces indicate that application endpoint state always goes back to Established state in Tno + 2 minutes in this case. So callers will not be able to connect to the application endpoint for Tno + 2 minutes.

Networking outages which lasts more than 10 + 2 * MaxRegisterRetries minutes causes Established => Reestablishing =>Terminating => Terminated state transitions. UCMA application needs to implement subsequent recovery procedure. If the UCMA application tries to reestablish the endpoint in each Tar minutes then callers will not be able to connect to the application endpoint for Tno + Tar minutes in this case.

As I mentioned above, you can eliminate the consequences of network outages by organizing multiple application instances into pools. This will help if networking issue occurs between one of the application instances and the FE pool. However, such isolated networking issues rarely happens in my opinion.