Behind the Curtain

The Science of Immersive Audio

This is an opportunity to dig deeper into how the human brain deals with sound arrival and with specific reference to TiMax SoundHub, how this can be both controlled and exploited for creative purposes.

The need for spatial sound systems

Human brains are designed to deal with complex audio landscapes.

Imagine a cocktail party with lots of people talking around you, someone talking to you, some background music, the chink of glasses – your brain can focus on certain sounds while filtering out others.

That’s because sounds coming from different directions are easier to differentiate and isolate. This is known as ‘Spatial Unmasking’ and is central to achieving separation, clarity and intelligibility in a mix.

But when all sounds come from just one source, like a loudspeaker, some sounds tend to block, or ‘mask’, certain others.

Also, within a conventional stereo left/right mix only a small part of the audience in the centre is enjoying a fuller experience, while most are hearing the mix from the biased towards the left or the right speakers.

A TiMax SoundHub spatial sound system, however, mimics real life, helping deliver truly localised sounds, so your ears and your brain better decipher what is happening.

But, to achieve such accurate localisation it is vital to control of both level AND delay – to manage as something called the Haas Precedence Effect.

The Haas Precedence Effect

The Haas Precedence Effect describes how the human brain localises to not just what is louder but also what it hears first, even if it’s less than 1ms ‘early’.  This results in ‘time panning’ which will inherently draw the listener to the preceding sound from the nearest speaker to them.

To put this in perspective, the distance between one audience member and their neighbour is about 2ms (2ft), so if you don’t manage precedence everyone is hearing a spatially different show.

This could mean one person is instinctively looking left to see who is talking, another person could be looking right, but actually it could be someone in the middle of the stage talking. 

The Haas Effect must therefore first be CONTROLLED, to eliminate the inherent problems it can cause, then EXPLOITED, for creative purposes.  Pro-active application of Haas delays can make a mix easier to achieve by helping maintain real separation between sources.

Control it or it controls you

In order to achieve accurate localisation in spatial sound reinforcement, TiMax SoundHub spatial rendering applies multiple variable ‘over-delays’ to ensure everyone localises to first wave-front arrivals from audio content sources – whilst automatically dealing with certain boundaries to be avoided.

For example, if there is a too-short over-delay, this can lead to comb-filtering where some wavelengths will cancel due to the very close first and secondary arrivals, resulting in varied colouration of the audio signals that can make things sound “tinny’ or “phasey”.  Amplified voices can be particularly susceptible as we readily notice if familiar fundamentals and harmonics are absent.

Equally, if there is a too-long over-delay this will expose the listener(s) to what is referred to as the Echo Perception Threshold, where a noticeable duplicate second sound arrival occurs which is heard as an echo, or multiple echoes blending into an undesirable and distracting ambience.

But there’s more – too loud a secondary arrival, i.e. the amplified sound, can overpower the use of Haas Precedence trying to achieve on-stage localisation, so TiMax SoundHub applies adaptive parameters to do variable level-shading to suit, which automatically compensates for the positions of audience members and speaker systems relative to the sources on stage being spatially reinforced.

Precedence Effect Parameters summary

To summarise, the numbers associated with the Haas Precedence Effect are quite simple (i.e. 0-10ms Comb-Filtering, 10-25ms Useful Imaging Effect, above 25ms Echo-Perception).

Source-oriented reinforcement

With properly applied ‘Haas Effect Imaging’ comes improved localisation of the original acoustic sources and little or no awareness of the amplification. This is also referred to as source-oriented reinforcement (SOR),  effectively giving each performer, sound source or playback track their own PA system in terms of timing and localisation. 

This helps the cognitive senses to both separate audio sources into a more natural panorama and also integrate visual and audio events so that they support each other instead of conflicting, which causes distraction and subtle stress in an audience. 

An audience member will ultimately find it distracting to see an actor on the right side of the stage but hear their voice from a speaker on the left or high above them.  Multiply this disconnect across a number of performers and you will have an audience subconsciously stressed by the effort of trying to discern who’s saying what.  The willing suspension of disbelief is therefore challenged and the performance less engaging

Dynamic delay-matrix with stagetracking

TiMax SoundHub’s unique dynamic delay matrix allows multiple simultaneous time alignments between each mic or sound source and every speaker, strategically managing precedence and hence localisation in real-time. With visual and audio stimuli in sync, for the entire audience to be immersed in believable drama or a rewarding musical performance.  

The addition of performer stagetracking automates this as the actors move around stage, leaving the sound engineer free to concentrate on the subtleties and dynamics of the mix. This is the origins the TiMax mantra of Hear the Sound, Not the System.

A further huge benefit of accurate localisation is clarity and impact through improved separation between the different sound sources in a spatial mix.  For amplified acoustic content such as voices and orchestra, TiMax SoundHub creates reinforcement which is sympathetic to the performance, rather than overpowering it as with more conventional approaches to PA.

So the ultimate objective is to draw the audience in to a performance, to prevent the technology from distracting or clashing with the performance. 

TiMax TrackerD4

Immersive Sound FAQs

Immersive audio describes an outcome, objective or aspiration achieved by spatial audio content and processes that result in an engaging and immersive experience for the listener, either on its own or in conjunction with other media such as light, video, scenography, odours.

“Immersive audio” is however often used as a convenient way to help relative audio laypersons differentiate between multichannel systems that do spatialisation and more conventional systems that just do stereo or mono.

Beware false prophets: “Immersive” cannot properly be used to describe a device, machine, loudspeaker, screen, light fixture, curtain etc, but often is. Full disclosure – we’re sometimes guilty of it ourselves, where it offers the convenience mentioned above when opening discussions about TiMax. But the trick is to always ascertain exactly what is meant, desired and/or aspired to when the subject of ‘Going Immersive’ comes up.

Spatial audio refers to audio technology that creates a three-dimensional sound environment, simulating the perception of sound coming from various directions, distances, and heights. It aims to replicate the way humans naturally perceive sound in the real world, enhancing immersion and realism in audio experiences. Spatial audio can range from an authentic or accurate representation of a real acoustic panorama, to an enhanced or synthetic soundfield aimed at achieving a more enveloping or “cinematic” creative spatial outcome.

Spatial unmasking is a phenomenon in auditory perception where the ability to detect and understand a sound is improved when it is spatially separated from competing sounds.

In other words, when a target sound is presented from a different direction or location compared to interfering sounds, it becomes easier for the listener to discern and focus on the target sound. This effect is often utilised in spatial audio processing to enhance the intelligibility of desired sounds in environments with background noise or multiple sound sources.

By spatially separating the target sound from interfering sounds, spatial unmasking can improve the overall clarity and perceptual salience of the desired auditory information, leading to a more effective communication and listening experience.

The Haas Precedence Effect, also known as the law of the first wavefront, describes a psychoacoustic phenomenon where the human auditory system perceives a fused sound image when presented with two identical sounds from different directions, and localises to the earliest one provided that the delay between the two sounds is short enough. In such cases, the perceived location of the sound is primarily determined by the direction of the first arriving sound, known as the “lead” or “preceding” sound, while the delayed sound provides the localised amplification, or added loudness.

In immersive spatial audio, the Haas Precedence Effect plays a crucial role in creating a realistic sense of space and directionality. By manipulating the timing and amplitude of audio signals delivered to different channels, audio engineers can simulate the natural panorama of a performance and accurately position sound sources within a three-dimensional space. This technique allows for the creation of lifelike auditory experiences where sounds appear to originate from specific locations around the listener, enhancing immersion and realism.

Source-oriented reinforcement (SOR) is an approach in audio engineering and sound reinforcement where the amplification and processing of audio signals are tailored specifically to localise individual sound sources or sources of interest within a given audio environment. Unlike traditional methods that focus primarily on uniformly reinforcing sound across the entire listening area, source-oriented reinforcement aims to provide localised amplification and processing to optimise the clarity, intelligibility, and impact of specific sound sources.

In source-oriented reinforcement, audio engineers strategically link microphones and speaker systems via delay-matrix signal processing to prioritise localisation of key sound sources, thereby enhancing the perceived clarity and intelligibility of performers on stage, presenters in a conference room or instruments in an orchestra or band. By delivering spatial reinforcement of these critical elements, source-oriented reinforcement helps to ensure that the audience receives the most important auditory information with clarity and precision, even in challenging acoustic environments or amidst competing background noise. This ultimately leads to more engaging and immersive auditory experiences for listeners.

Object-based audio is an audio technology that enables content creators to produce immersive audio experiences by representing sound as discrete audio objects rather than traditional channels. In object-based audio systems, each sound element in a scene—whether it’s dialogue, music, effects, or ambience—is treated as an individual object with its own metadata, including positional information, volume, and other characteristics.

Unlike traditional channel-based audio formats like stereo or surround sound, which require fixed speaker configurations and pre-mixed audio channels, object-based audio provides flexibility and adaptability in sound reproduction. Audio objects can be dynamically positioned within a three-dimensional sound space, allowing for precise localisation and movement of sound sources relative to the listener’s position. TiMax SoundHub is an example of such object-based audio system.

The Image Definition is a particular type of audio object unique to TiMax, as opposed to the familiar coloured blobs we all know that represent input sources such as audio tracks, mics, instruments etc in the likes of Atmos and other spatial platforms, including TiMax.

TiMax lets you scatter Image Definition spatial objects around a space where you want it to make the sound come from – on stage, in surround, overhead, or in special imaging layers for things like reverbs and effects.

At the push of a button TiMax then pre-renders these Image Definitions to apply the correct delay and level parameters to the speakers to make the aforementioned coloured input blobs come from the right places when you place them on or in between them. In simple terms Image Definitions can be thought of as just elaborate output routing objects chosen by you to achieve the spatialisation objectives you want, and which make sure these work for the whole audience.

TiMax can achieve effective localisation and spatialisation with often surprisingly minimal and flexible loudspeaker system topologies. While some other spatial platforms use algorithms that can require quite rigidly defined numbers of speakers, angles and spacings, the flexibility of TiMax Image Definitions means it can create convincing vocal imaging for instance with a simple LCR system, even on a 15-20m wide/deep stage.

Furthermore, where stages and audience areas are challengingly asymmetrical, TiMax can spatially render these with ease. There are often venues, such as for corporate or experiential events, where loudspeakers in non-ideal locations also have to be hidden from view, or the audience area requires coverage from multiple directions. TiMax can handle this no problem, as long as there are at least some speakers pointing from roughly the direction the sound needs to be localised. If in doubt – ask us – we’ve done this stuff before.

As far as the physics is concerned, the answer is yes, because the speed of sound is obviously the same in all directions, so can be used to localise in the z-plane as much as the x and y. In Disney’s Aladdin worldwide franchise a key moment was the flying carpet where you couldn’t see the wires, and TiMax with Tracker were used to make Aladdin and the Princess’s voices come from up in the air.

But the geometry and physiology of our ears sometimes make it harder for us to hear “overhead” due to to shape of the pinna being idealised for horizontal localisation. So with the addition of movement, as in Aladdin, the variable filtering effect actually tells the brain the sound is overhead, effectively by simply changing its EQ.

Good question. There is a volume level limit over which the Haas Precedence stops working and sheer power takes over the imaging. Also high frequency content such as sibilance tends not to delay-image as well; effective Haas Precedence is more of a midrange thing based around the sort of bandwidth occupied by voices and the like.

But we have some tricks to deal with this. Firstly, rolling-off the top-end of vocal mics and/or a dedicated vocal system by 2-3dB lets the engineer get more level on the fader while still getting good localisation.

Where there’s a loud band in the pit, or it’s a large outdoor stage, we add “anchor” speakers upstage or in the set to amplify the “time zero” of the voice, ie making it louder coming off stage, so you can again get a bit more level on the fader to achieve usually 8-10dB of amplification without losing vocal imaging.

If you want to impress people with a bit of jargon, this technique is sometimes rather grandly known as First Wavefront Reinforcement, and it does work remarkably well.