“504 Server time-out” While Creating Audio Conferences on a Lync Enterprise Edition Front End Pool

This blog post describes what happens in the background when a middle-tier UCMA application schedules audio conferences in a Lync Server 2013 Enterprise Edition Pool environment. It also describes what the Lync Server Pool does in the background in order to provide a highly available solution for conferencing and what UCMA application needs to do when a specific symptom (“SIP/2.0 504 Server time-out”) occurs.

Please note that the description below is based on my personal observation, investigation and includes my personal conclusions. I have analyzed dozens of Lync Server trace files in the last few years. The description below is based on that experience. So nothing guarantees that the description below is error-free.

The figures below do not show all of the signaling (SIP) messages. Only those messages appear in the figures which are relevant from the current context. This blog post assumes a Lync Enterprise Edition Pool environment with 3 Front End Servers and a UCMA application which uses conferencing. The UCMA application hosts a single application endpoint which creates audio conferences. After an audio conference is created, the endpoint dials into the conference and finally dials a Lync user out from the same conference. This simplified scenario (1 endpoint; 2 participants per conference) is completely enough to describe what this blog post wants to talk about.

Registering an application endpoint

First of all, let us see what happens in the background when the UCMA application registers an application endpoint by invoking the ApplicationEndpoint.BeginEstablish() method.

  1. the UCMA layer in the application resolves the Lync Front End Pool FQDN from DNS. Each time the Pool FQDN is resolved, the DNS gives back the IP addresses of the Front End Servers in round-robin fashion

  2. then the UCMA application sends a SIP REGISTER request to one of the Front End Servers

  3. if that Front End Server is not the so called “Home Front End Server” or simply “Home Server” of the application endpoint then the UCMA application receives a SIP 301 redirect response. The Contact header in the SIP 301 response message includes the SIP URI of the Home Server

  4. then the UCMA application turns to the Home Server and sends a REGISTER request again

  5. and if everything works fine then the UCMA application receives a 200 OK from the Home Server indicating that the application endpoint is registered successfully.

So, regardless which particular Front End Server the UCMA application contacts first, the application will be redirected to the Home Server.

Scheduling the 1st audio conference

Now that the application endpoint is already registered let the application endpoint create an audio conference and see what happens in the background

1) When the UCMA application invokes the ConferenceServices.BeginScheduleConference() method then a C3P (Centralized Conference Control Protocol) AddConference request is sent to the conference Focus Factory in order to create the conference. Focus Factory URI looks like this: sip:endpoint@contoso.com;gruu;opaque=app:conf:focusfactory. According to my observation, the application endpoint always turns to the Focus Factory which is located on its Home Server. So the AddConference request always goes to the Home Server.

figure_2.png

When the conference is created successfully, the UCMA application gets back the “conference focus URI” e.g. sip:endpoint@contosom.com;gruu;opaque=app:conf:focus:id:45WNQFYZ. The application will use this Focus URI to issue conference control commands (e.g. add participants, terminate conference). The Focus itself also “resides” on the Home Server.

2) After the conference is created, the UCMA application can join participants to the conference by invoking the ConferenceSession.BeginJoin() method. This results in sending C3P AddUser requests to the Focus (conference focus URI). As I mentioned above, the application endpoint will join the conference as the first participant. It will use dial-in method to establish media session to the conference. The AddUser request and consecutive C3P requests always go to the Home Server since the Focus is located there but the requests might traverse different paths until they arrive at the Home Server. They might go through different Front End Servers. The Front End Servers in the pool acts as SIP proxies. They forward the requests toward the particular Front End Server which hosts the Focus for the conference.

When a participant joined the conference successfully, it starts appearing in the conference roster. Moreover, the UCMA application gets back an

  • “MCU conference URI” (e.g. sip:endpoint@contoso.com;gruu;opaque=app:conf:audio-video:id:45WNQFYZ) which will be used to establish media sessions to the conference.
  • “MCU server URI” (e.g. sip:pool01.contoso.com:5063;transport=tls;ms-fe=fe3.contoso.com) which will not be used by the UCMA application directly but it shows the Front End Server whose AV MCU is allocated to mix media streams in the conference.

The Focus and the MCU might be located on different Front End Servers in the pool.

3) Now that the application endpoint joined the conference successfully as the first participant, it needs to establish media session to the conference. As I mentioned above, it will use the dial-in method. The dial-in method starts with an INVITE request sent to the MCU conference URI. The INVITE request which is sent to this URI actually also goes to the Focus (Home Server) first. Then the Focus talks to the allocated MCU behind the scenes. Thus all the conference related signaling seem to go through the Focus. This is just the signaling. The media session is established between the participant and the MCU. This is because the SDP in the 200 OK - which is received as the response to the INVITE request - includes the MCU’s candidates (IP address + port).

4) Now the application endpoint joined the conference successfully, it already appears on the conference roster and its media session is already established toward the proper MCU. It is time for the UCMA application to add another participant to the conference. As I mentioned above, the 2nd participant will be a Lync user and the application endpoint will use dial-out method to connect the participant to the MCU. The figure shows the new C3P AddUser request sent by the application endpoint to the Focus. As you can see, the same signaling path is used to add the 2nd participant. This is because the “Record-Route” SIP header was used under the hood at the time the 1st participant was added and the “Route” SIP header is used to force subsequent AddUser requests to traverse the same path.

5) After the Lync user is added to the Focus successfully, the next thing to do is to dial out the Lync user from the conference in order to establish media session between him and the MCU. As the first step of this dial-out procedure the Lync user gets a SIP INVITE from the MCU conference URI (the INVITE is initiated by the MCU and goes through the Focus). As soon as the Lync user accepts the call, the associated media session will be established toward the proper MCU since SDP in the INVITE carries the MCU’s candidates (IP address + port).

6) Now let us summarize the outcome of the previous 4 steps. As the following figure shows

  • We have an ongoing audio conference with 2 participants: the application endpoint and a Lync user
  • SIP dialogs belonging to the different participants might use different paths but all of them go to the Focus (Home Server)
  • Media sessions belonging to the different participants goes to the same MCU; to the MCU which was allocated by the pool to mix media streams in the conference
  • The conference Focus and the MCU might be located on different Front End Servers

Scheduling the next audio conference

By now the application endpoint scheduled its 1st conference and participants are connected successfully to that conference. Let us see what happens when the same application endpoint creates the next audio conference.

As already mentioned above the UCMA application endpoint turns to the same server (Home Server) each time it schedules a conference. This is because the conference Focus Factory used and the conference Focus for the conference is always located on the Home Server. However, the MCU which is allocated by the Lync pool to mix the media streams for the participants changes from conference to conference. This means that the MCU related load is distributed by the Lync pool between the available Front End Servers. The following figure shows how the signaling and media paths might look like when the 2nd audio conference is scheduled and participants are connected. As you can see the MCU used by the conference is located on another Front End Server (“Front End B”) now.

Losing the Home Server

That’s nice. We have seen what happens when our application endpoint schedules multiple audio conferences. We have seen that the Lync Enterprise Edition Pool distributes MCU related load between the available Front End Servers.

We have also seen that the Home Server of the application endpoint plays a key role. The conference Focus is located there and all the conference related SIP signaling messages go to there. So, a question which naturally arises is the following: what happens if we lose the Home Server permanently?

The answer is quite simple: assuming that the Lync Front End Pool is still operational (meaning the pool still has the minimum number of Front End Servers required) another Front End server will be designated as the Home Server. When the original Home Server is lost, the UCMA application will receive OperationTimeoutException (“SIP/2.0 504 Server time-out”) exception each time it tries to schedule a conference. The endpoint state changes automatically from Established => Reestablishing (Reason: RegistrationRefreshFailed) => Established (Reason: RegistrationRefreshSucceeded). However, the endpoint still cannot create new conferences. Each attempt results in “SIP/2.0 504 Server time-out”. Tracing the Lync Pool shows why: the pool still tries to use the lost Home Server to create the conference. In order to recover the application endpoint, the UCMA application needs to unregister the endpoint and register that again. Without this re-registration, the application endpoint will not recover; it will not be able to schedule conferences. If the registration is done successfully then everything starts working again as described above. When the UCMA application re-registers the application endpoint then the registration request will be redirected by the pool to the new Home Server as shown in the following figure. Then this new Home Server will be used to create conferences.

The endpoint can schedule audio conferences again and participants can connect to those conferences. The conference Focus will be hosted on the new Home Server and MCU related load will be distributed by the Lync Pool between the Front End Servers which are still available.

Summary

So, if a UCMA application endpoint uses audio conferencing then the following happens on the Lync Server Enterprise Edition Pool

  1. The Focus Factory located on the Home Server is used to create conferences. The Focus belonging to the conference is also located on the same server.
  2. All SIP messages related to the conference go to the Focus. The Focus talks to the MCU behind the scenes
  3. Front-End Servers in the pool acts as SIP proxies to forward messages to the Focus
  4. The Lync pool allocates MCU dynamically; thus it distributes MCU related load between the Front End Server available in the pool
  5. A new Home Server is designated automatically by the pool when the original Home Server is lost
  6. However, the application endpoint needs to be re-registered if its Home Server is lost. It will not recover otherwise. Until this re-registration is done, no conferences can be scheduled by the endpoint.

4) and 5) together provide a highly available, fault tolerant conferencing service in the Enterprise Edition Pool. Unfortunately, as 6) describes the UCMA application needs to re-register the application endpoint if Home Server is lost. This is a drawback. It would be great if this recovery procedure were hidden from the application.

I hope this description includes valuable information to understand what happens under the hood if UCMA based audio conferencing is used.