As remoting gets better, we’re going to have to worry about audio synchronization more
It hasn’t been much of a problem in the past, but it will be in the future. It’s time to start benchmarking audio synchronization performance.
Most remoting solutions consist of multiple channels and codecs (compression algorithms). Compared to other data streams/channels, graphics is one of the most challenging. Most benchmarking discussions around cloud and remoting tend to focus on the graphics protocols and the various codecs they leverage and combine, i.e., Microsoft RDP, Citrix HDX/ICA, Teradici PcoIP/Cloud Access Software, Mechdyne TGX, HP RGS, and VMware Blast. All can leverage the H.264 video codec and often JPEG/BMP compression, too.
Whilst the various graphical/video codecs are pretty well known, the vast array of audio and speech codecs is less so, but includes SPEEX, OPUS, and Vorbis. If you poke around in the Citrix system requirements you’ll find that Vorbis and SPEEX are available under the hood.
Historically, most protocols have separated audio (traditionally voice) from graphics; this makes a lot of sense particularly as different QoS (Quality-of-service levels) can be attributed to different channels. For Webex/Go-to-meeting/Skype, you are usually more concerned that the voice quality stays high, and you can live with a little flakiness in the speaker’s headshot video.
A decade ago, graphics limitations, hardware, and network bandwidth meant default VDI frame rates were often 12-16 fps (frames per second) rates (now probably up to 24 fps for commodity VDI/apps), and indeed rates of 10-12 fps are often still used at the low end in protocols such as VNC for support shadowing work.
However, high-end graphical use cases enabled by datacenter GPU technologies from the likes of NVIDIA, Intel, and AMD mean that frame rates of 60 fps are becoming more common, and other use cases are being considered. And, it’s reached a stage where audio synchronization is likely to become the neglected ginger-haired step-child awkwardly present at the family dinner.
Hints that audio is becoming a problem
Synchronizing video to audio channels is very hard, and the use cases to date have meant that few have noticed that it’s not something most protocols do very well at all. In fact, it’s something there’s negligible information on and little community investigation has occurred. Basically, it hasn’t really been a problem to date, so few have had any reason to prod the weak points. However, with so many graphics challenges now solved that would have stopped you ever getting to the point where audio synchronization was an issue, the signs of challenges are starting to appear.
In Citrix’s case, the audio codecs mainly used are SPEEX and Vorbis, both of which have to a large extent been replaced by OPUS by many. SPEEX, as the pun-of-a-name suggests, is aimed at voice data, whilst Vorbis is more tuned to music/general audio (hidden away in the Receiver partner guides [PDF]). These are essentially server-“rendered” audio. There have been a few hints that this was an area of technology that generally isn’t as good as it will need to be; an obscure support paper from NVIDIA (back in 2016) indicated that if you were hoping to do VFX-type video editing you probably needed to look at this: “The remoting protocols and virtualization stacks also handle audio/graphical channel synchronization differently and users should consult their remoting vendor e.g. Citrix/VMware/other on their technologies to avoid drift over time.”). A few have noticed the tendency for some protocols to drift but it’s rare—however, recently there are signs this may become of interest to the mainstream VDI user, as well as the cutting-edge new video and VFX cases.
A few weeks ago veteran CTP Tobias Kreidl published a (very good—a must read!) guest blog on the Citrix site reviewing the Raspberry Pi as a thin client in which he noted some setting tweaks he’d made to overcome audio/video synchronization issues. In particular, he noted:“The audio quality problem turned out to be fixed by a modification to the ‘timer-base scheduling’ parameter. This fixed not only the poor sound quality with the analogue jack, but also HDMI audio output quality issues. While the claim is that digital audio quality may suffer some with this setting, to me it was imperceptible.”
Now Tobias was using the Raspberry Pi, which means he was using Citrix Receiver client for Linux. Usually, some of the work required to synchronize audio and video takes place in the client, so OS-specific features like this may not be available on Windows Receiver etc. Tobias switched a setting (Audio Latency Control) under the hood that allows packets to be dropped to keep the audio in-sync. It’s a fairly effective trick that Tobias found works well for most VDI users.
The “Audio Latency Control” is a very obscure parameter, while the Linux Receiver is a product used by OEM thin client manufacturers to incorporate into their own products and historically it was developed by the UK HDX performance and research team which means it often led on features, performance, and OEM feature requests. The obscure OEM Linux Receiver partner guide [PDF] is pretty much the only place you’ll find much information on this or SPEEX/Vorbis (particularly around page 43-44). To the best of my knowledge, the Audio Latency Control feature has never propagated to Citrix’s other clients (e.g., for Windows) and there’s certainly a lack of definitive documentation.
Other protocols
There are a few ways audio-to-video synchronization can be achieved, and it’s been a field of intense interest outside VDI, particularly with video streaming services like Netflix. For pure video, there are a wider range of options, as there are usually a lot of tricks to be done with buffering. These, however, aren’t really an option with interactive VDI and graphics/gaming. Dropping packets with a (hopefully) imperceivable degradation or Time Stamping (making sure the audio and graphics packets can be matched up) are options, but they come with quality or performance overheads.
Beyond Citrix, very few protocols seem to publish any information on their capabilities, although comparable high-end protocols such as Teradici do seem to have some capabilities, for enhanced audio/video sync. But much like Citrix, they indicate there is a small impact on performance: “This feature introduces a small lag in user interaction responsiveness when enabled.”
After trawling and googling through vast reams of documentation, support articles, and conference presentations, I found nothing to indicate specific capabilities were available from other major protocols. The young VMware Blast Extreme protocol seems entirely focused on information around graphics.
With RDP, I certainly found a lot of folks experiencing audio latency, but these were mostly associated with more significant issues than raw fundamental protocol issues (e.g., broken drivers, settings mis-configured, incompatible headsets, etc.). Like graphics, the raw protocol often gets the finger for more basic issues once limitations are known.
The corner cases
There’s been an increase in interest for VFX type use cases and whilst the graphics is now good enough, so much so that Jellyfish pictures have moved a fair amount of work to cloud and the Nimble Collective have launched their Cloud Animation Platform. I suspect, though, that the very extreme demands on audio and graphics and also the synchronization needs mean that high-end video and film editing is probably still best left on beefy workstations for now.
It’s time the community weighed in on audio synchronization
Over the last five years, as cloud GPUs from the likes of NVIDIA, AMD, and to some extent Intel have taken off, there’s been a great deal of investment and innovation from protocol vendors and product managers in graphics remoting, and also a huge amount of community scrutiny and input.
Tools like RDAnalyzer and GPU Profiler, as well as mainstream commercial monitoring and analysis products (Login VSI, Fraps, Goliath, ControlUp, etc.) have advanced greatly. There’s also a body of community experts discussing H.264 vs. RLE Bitmap encoding and artifacts pop up at every EUC conference—the likes of Helge Klein, Barry Schiffer, Ben Jones, Ruben Spruijt, Benny Tritsch, Tobias Kreidl, Martin Rowan, Magnar Johnson, Bram Wolfs, Bas van Kaam, Rody Kossen… and countless others too numerous to name.
Yet, we haven’t seen the same level of discussion or investigation on audio/speech codecs (e.g., SPEEX vs OPUS vs Vorbis). The ARM Cortex-A Neon extensions included in the Raspberry Pi Tobias was using have a specifications list that include a wide range of audio codecs, but little discussion on their relative merits occurs, let alone quantifying the capabilities with respect to synchronization to graphics.
So, for 2019, I’d love to see those community groups kick the tires and look closer at how robust protocols are in this area, as there certainly isn’t much information available from the vendors themselves.