On Voice Response Scalability

Microsoft UCMA Core API and Microsoft UCMA Speech API provide easy to use programming interfaces to play audio messages to callers. Associated development kits are widely used to implement Microsoft Lync tied or standalone ACD and IVR applications where playing prerecorded audio and dynamically constructed TTS messages are basic functionality.

These programming interfaces are well documented; a lot of technical articles and examples are on the Web. However, you can hardly find materials talking about scalability e.g. listing memory and CPU requirements to provide voice services for a given number of callers at the same time while using these programming interfaces. There are no documents stating how such applications can scale by adding more memory and CPU resources to the computers applications are running on. Providing answers to such questions is essential in current virtualized application world and hosted environments.

I know that specifying memory and CPU requirements is not so easy since the amount of resources consumed by an application might be quite application specific. It depends on the quality the application code is written and depends on which API methods are used and how they are used.

This blog post includes measurements obtained on an Intel Core i5 2.4 GHz (M520), 8 GB RAM, x64 computer while playing prerecorded audio using the WmaFileSource and Player classes from the Microsoft.Rtc.Collaboration.AudioVideo UCM Core API namespace and playing dynamically constructed TTS messages using the SpeechSynthesizer and SpeechSynthesisConnector classes from the Microsoft.Speech.Synthesis Speech API namespace. The prerecorded audio was in a wma file format (361 KB) while the TTS message was the following one: “As of #{CURRENT_TIME} your credit card balance is #{RANDOM} dollars”. Both of these messages were played repeatedly to each caller.

The following 3 scenarios were observed:

  • Single wma source + Single audio player shared by calls: the wma file is opened/loaded once, a single player is attached and the output of this player is played to each caller connected. This scenario is generally used to play customer independent messages where it is not important to play message from the beginning (e.g. hold music or ringback tone).
  • Single wma source + Dedicated player for each call: the wma file is opened/loaded once but a separate player is used for each call. This scenario is used to play customer independent messages where each caller should hear the message from the beginning (e.g. welcome message, menu options).
  • Dedicated TTS for each call: a separate speech synthesizer is used to play the above mentioned TTS message to each caller. This scenario is used to play customer specific messages where each caller should hear the message from the beginning (e.g. current balance).

This means I have observed only scenarios where application plays messages to callers. I did not observe other API methods used to process audio coming from caller to application (e.g. call recording, speech recognition). These might be the subject of another blog post.

Memory usage

Let us see how much memory the application used while providing services for different number of callers (average values calculated over 5 minutes intervals).

memory.png

Observations:

  • Memory requirement seems to increase linearly in each scenario as the number of calls increases. This is nice regarding scalability.
  • The 2 scenarios playing prerecorded message require almost the same amount of memory. This means that introducing a separate player for each call does not affect the amount of memory required.
  • The scenario playing separate TTS message to each callers requires much more memory but regarding cheap memory nowadays even this scenario can be used to provide service for a large number of callers (e.g. 100 calls – 520 MB)

CPU utilization

Now let us see the CPU utilization (average values calculated over 5 minutes intervals).

cpu.png

Observations:

  • Again, the 2 scenarios playing prerecorded audio require almost the same amount of resources and resource requirement seems to increase linearly in both cases. This is nice regarding scalability. It seems that using a separate player for each call does not increase CPU utilization.
  • However, playing separate TTS messages to callers requires a huge amount of CPU resources which does not seem to scale well. Even in the case of 100 calls an Intel Core i5 2.4 GHz CPU is almost fully loaded. Moreover, based on our observations the quality of the TTS message callers hear degrades dramatically when CPU utilization is above 60%.

Summary

The numbers above show that an application server can easily play prerecorded audio for more than 100 callers at the same time. Moreover, the applications can scale smoothly by adding more and more memory and CPU resources to the application server.

However, you should take care if your application plays separate TTS messages to callers. You should make measurements in advance and clearly specify how much CPU resources your application requires to provide services for a given number of callers. Even a relatively small number of callers can result in a fully utilized CPU and your entire application server can easily go down. You should also take preventive actions either on application or networking level to limit the number of calls which can hit your application server at the same time. This is especially important if we are talking about a Lync application exposed to a large pool of corporate Lync users or if the application is publicly available and DoS attacks can happen at any time.