Management Network Leads to 5 Seconds Delay in PSTN Dialout

Assumption: Your Lync Server infrastructure is virtualized and you have 2 network interface cards both in the Front-End Server(s) and Mediation Server(s). One network interface card belongs to the production (data, voice) network and the other one belongs to the virtual machine management network.

You have a middle-tier UCMA application in this environment which schedules an audio/video conference and dials out a PSTN number from the conference using AudioVideoMcuSession.BeginDialOut(telUri, callbackFunction, …).

Symptom: Your UCMA application faces 5 seconds delay, meaning that the following happens:

  1. the PSTN number is dialed out successfully from the conference
  2. as soon as the alerting call is answered, the dialed out PSTN party becomes connected to the conference
  3. if there are other participants in the conference then the dialed out PSTN party can hear the other participants and can talk to the other participants. 
  4. However, the callbackFunction passed to the BeginDialOut() is not invoked in time. It is invoked 5 seconds later than expected.

So, the PSTN party is already in the conference but the UCMA application does not know about that. The application is notified 5 seconds later.

Normal circumstances

Before looking at the root cause of this 5 seconds delay let us see what happens in normal circumstances when you have no management network. Let us assume the following simplified Lync Server infrastructure which consists of a Lync Front End Server, a Mediation Server, an IP gateway and a UCMA Application Server and a production network 10.168.x.x.

As already mentioned above, the UCMA application running on the application server registers an application endpoint, schedules a conference and then dials out a PSTN number from the conference. 

As the next figure indicates, the following happens on the application level (green) and the SIP signaling level (red): 

1. The UCMA application invokes AudioVideoMcuSession.BeginDialOut(telUri, callbackFunction, …) to dial out a PSTN number from the conference. As the method signature indicates, the application provides a callback function to be called back with the outcome when the dialout process succeeds/fails.

2. In the background a SIP INFO message is sent by the UCMA layer to the AV MCU. This SIP INFO conveys the tel URI to dial out and other information necessary for the MCU

3. The MCU dials out the PSTN number through the Mediation Server. This means setting up a SIP dialog between the AV MCU and the Mediation Server. When the PSTN call is answered, an ACK is received by the MCU and the associated SIP dialog is established. Both the SIP INVITE and the 200 OK include SDP starting with something like this:

a=candidate:1 1 UDP 2130705919 10.168.x.x 51212 typ host
a=candidate:1 2 UDP 2130705406 10.168.x.x 51213 typ host
…. other reflexive and relay candidates

So, the highest priority candidate in the SDP is the host candidate belonging to the production network interface 10.168.x.x. All the other candidates in the SDP (reflexive, relay) have lower priority. 

4. Both the MCU and the Medation Server perform candidate checking/probing (STUN). Then a re-INVITE is initiated by the MCU in order to exchange the final candidates. The candidate probing takes a very short time. Typically, there is only a few 100 msecs between the 1st ACK and the re-INVITE.

5. After the re-INVITE is done, the MCU sends an INFO message to the UCMA application in order to notify the UCMA application about the outcome of the dialout request.

6. The UCMA layer invokes the callback function provided by the UCMA application. Inside this callback function, the UCMA application invokes the AudioVideoMcuSession.EndDialOut() which returns the dialout result

All this means, that in normal circumstances the UCMA application is notified about the dialout outcome in a few 100 msecs. INFO hits the application server in a few 100 msecs after the dialed out PSTN party answers the call.

Management network

Now let us see what happens when we have a management network 192.168.x.x

As the following figure indicates, if you would take a look at the SDP sent in the INVITE and 200 OK in this case then you could sometimes see an SDP starting with 192.168.x.x as the highest priority candidate:

a=candidate:1 1 UDP 2130706431 192.168.x.x 49190 typ host
a=candidate:1 2 UDP 2130705918 192.168.x.x 49191 typ host
…. other reflexive and relay candidates

It turns out the ICE implementation used by the MCU and the Mediation Server uses all the local network interface cards to initialize host candidates. It uses the management network interface card as well. There is no way to exclude specific network interface cards from this process. Even if you use the "Limit service usage to selected IP addresses" in the Topology Builder, Lync Server will use all of the network interface cards to initialize host candidates.

Moreover, the way the highest priority host candidate is selected seems to be not consistent. Sometimes you can see the management network interface (192.168.x.x) appearing as the highest priority host candidate. Then after restarting the Lync Server services, you can suddenly see the production network interface (10.168.x.x) appearing as the highest priority host candidate. The process of selecting the network interface for the highest priority host candidate seems to be random. The binding order of the network interface cards in Windows does not matter.

The 5 secs delay in PSTN dialout

The reason why the UCMA application faces the 5 seconds delay is indicated in the following figure.

It seems the ICE implementation (candidate checking/probing) used by the MCU and the Mediation Server always gives 5 seconds chance for the highest priority host candidates to succeed. Actually, this means that whenever management network interface (192.168.x.x) appears in the 200 OK as the highest priority candidate, the MCU waits 5 seconds effectively doing nothing. Even if probing for the candidate 10.168.x.x already succeeded, it waits 5 seconds for the candidate 192.168.x.x to succeed. Which actually will never happen since this IPv4 address belongs to the management network. Thus there is an inherent 5 seconds delay between the 1st ACK and the re-INVITE. Which means the UCMA application also receives notification (INFO) 5 seconds later than expected.

Summary

The 5 seconds delay in the PSTN dialout process is originated from the fact that the ICE implementation used by the Lync Servers (MCU, Mediation)

  • Uses all local network interface cards to setup host ICE candidates in SDP
  • There is no way to exclude specific network interface cards from this process
  • It selects the highest priority host candidates inconsistently; it might use a different network interface for the highest priority host candidate after a server restart
  • During ICE probing it gives a 5 seconds chance for the highest priority host candidate to succeed