Over the past year high-resolution audio has become more and more popular. The logic behind it is that the additional bits provide more data, lower noise, and more dynamic range, but is this really the case?
Several chips on the market have specifications that seem to say so with resolutions in the 32- and 64- bit ranges and sampling frequencies upwards of several hundred kHz or even mHz.
The truth is that it's just not possible with modern technology in real-world scenarios to achieve these resolutions even for 24-bit audio. Let's see why.
First let's talk about sampling theory.
The process of sampling is to obtain signal values from a continuous signal at regular time intervals. Sampling rate is the equivalent of one over the sampling interval. The result of sampling is a sequence of numbers. At this point it's still not a digital signal because those numbers can be any number from a continuous range. Sampling extracts the signal value at all integer multiples of the sampling interval.
The opposite step is called interpolation, wherein we take a sampled signal and reconstruct a continuous signal from it. To reconstruct a continuous signal from the samples we must "guess" what the value of the signal could possibly be between the samples — this is an important concept to the understanding of why higher sampling rates are better. The most common type is linear interpolation where we effectively draw a line between the steps and arbitrarily pick a value along that line. This creates a continuous time signal – In theory.
Nyquist Theory states that in order for a continuous signal to be reconstructed perfectly the desired frequency to be reconstructed must be less than half the sampling frequency. This is not less than or equal to, because if the sampling frequency is only double it may or may not be a perfect reconstruction — this would only work if the sample instants happen to coincide with the maxima of the sinusoid, but when the sample instants coincide with zero-crossings you either capture nothing or the sine wave with the wrong amplitude. This causes aliasing. At it's simplest, if you sample a frequency of 8 Hz at a sampling rate of 10 Hz you get both 8 Hz and 2 Hz sinusoids. When we are speaking in terms of audio reproduction, we want only 8 Hz and not the information that doesn't exist on the recording.
Let us make it more complicated still. Ultra High frequency noise intrudes into our signal. When the frequency present is higher than the sampling frequency it causes what is called foldover. Foldover spectrally shifts this noise into the baseband (or audible spectrum). Meaning HF noise gets folded into the digital signal that we want to capture. Furthermore, these frequencies will not have any harmonic relationship to our signal. That means even if the noise is well above our filter range, it gets folded back into the signal and causes amplitude distortion. This is even more annoying than harmonic distortion (which really isn't that big of a deal in digital audio).
After we have sampled the signals we have a string of numbers that can still take on any value in a continuous range. This means there's an infinite number of possible values for each number and an infinite number of digits. We don't have that many at our disposal. So once we have the discrete time variable we then have to discover the amplitude variable as well. This process of finding the amplitude variable is called quantization.
Let's assume we have to find values between -1...+1 and each of these values must be two decimal digits. This would give us -1, -0.9, -0.8..., 0.1, 0.2, etc. and so on. These numbers are arbitrary for example purposes. Each of these quantization levels then has a range of values that denotes the quantization sidestep. For instance, anything between -0.05 to +0.05 would be a 0.0 value. This leaves room for error because values are never exact. This error is restricted to a small range. These errors are known as quantization noise. The complexity of quantization is beyond the scope of this article, so I won't dive too much deeper.
For every bit, the signal to noise ratio increases by 6dB. So, in theory, at 24 bits we can have a 146dB signal to noise ratio. However, to achieve this level of noise we would have to take a sine wave at full amplitude, which is higher power than any audio system will ever see. For each 6dB of headroom we relinquish one bit of resolution.
Basically what we're talking about is amplitude distortion caused by noise. But let's step back and examine why high resolution may not be possible with modern technologies.
The heat tolerances required for high-resolution audio are extremely high. Heat generates noise, and the more data that needs to be processed in the same amount of time the more heat that is generated. The more noise, the less possible resolution.
But heat isn't, of course, the only source of noise. We have power supply noise, timing errors, EMI generated by the components themselves, RF noise, and ultra high frequency noise. All of these types of noise create foldover, aliasing, and amplitude distortion.
But how little noise is this really? To achieve 16-bit audio accurately, you must have a signal to noise ratio of 96dB, which is approximately 100uVrms of noise. To achieve 24-bit accuracy you actually need a signal to noise ratio of 144dB or 100nVrms of noise. And just for shiggles... 64 bit resolution requires a -384dB S/N ratio.
Even the very best DAC chips on the market don't spec out above 127dB S/N ratio. And that's just the DAC chip, it doesn't factor in the S/N ratio of your entire system. If the noise floor of your system isn't well below the 100uV margin you aren't hearing more than 16-bit resolution, and often times much worse.
That's just from a system limitation standpoint. Let's talk about the scale of data that needs to be processed. Because we're not just talking about a single bit, we're talking about 24 bits with several thousand samples per second per bit.
Bit Depth refers to the number of bits you have to capture an audio signal. This can be envisioned as a series of levels that are sliced in a given moment of time. For 16-bit audio you have 65,536 possible levels or amplitudes at any given moment of time. With every additional bit of resolution the number of different possible amplitudes doubles. By the time we reach 24-bit we have 16,777,216 levels. Remember, we are talking about a slice of audio in a single moment of time. We haven't even added the fact that all of this data needs to be processed thousands of times per second. Scared yet? Let's keep going.
Now let's add time to the equation. Sample rate as you now know is the number of slices per second. So for a Redbook CD you have 44,100 slices every second. This means you have close to 17 million variables every second. If that sounds like a lot of data... it is.
So let's say it is possible for our equipment to handle this amount of data in the proper amount of time. Where's the real issue?
Digital may be ones and zeros, but it's represented on a square analog waveform. This square waveform is still susceptible to extraneous voltages and noise.
All noise carries a small amount of voltage. This voltage then adds or subtracts from the voltage applied to digital signal. This may not seem like much, but when you have 65,000+ levels to contend with, a small amount of amplitude distortion caused by noise on every level adds up in a hurry. This amplitude distortion is precisely why many digital systems sound "digital". The noise in the power supply and in the circuit dramatically limits its ability to reproduce music. Remember quantization error? This is where it plays a huge role.
For a while I did some work on a discrete R2R DAC in the effort to create a device that could truly reproduce accurate 24-bit audio. The highest tolerance resistors that can possibly be made were by Vishay at .005%, which is a VERY high tolerance and not cheap. Even with a FPGA controlling and balancing the tolerances directly, the very subtle differences in resistance on the resistors created non-linearities and noise. The best I was able to achieve was 20-bit even with a -143dB noise floor.
So why does high-res sound better than its low-resolution counterpart? Well, it only does in some circumstances. You have three camps. One where it sounds better, one where it sounds worse, and one where there's no difference. Why is this? Think of it in terms of ratios. Let's say we have two pies, one with 24 pieces and one with 16 pieces. If someone sticks a finger in the pie and ruins it (distortion) the amount of information that's distorted in a 24 bit pie is proportionately smaller than in a 16 bit pie. BUT – of course there's a but – there's more to it than that.
When you have a higher noise floor you are losing the additional bits above the noise threshold. So if you have 24 bits, but your noise floor is only high enough for 16 bits, you loose AT LEAST the 8 bit difference and often more. It's like packing a snowball, your compressed snowball always has less mass than the initial clump of snow you started with. The excess energy is converted to heat or other forms of energy and are lost to the digital stream.
So as your system resolution improves you will notice a few things that occur. You will begin to hear that compression and "loss" from 24-bit files. When you have no distortion on the pie then a 16-bit file will sound better because it's easier for the system to process and will still be a full pie. But even if a 24 bit file doesn't have distortion, you still have to deal with the compression caused by the lack of part tolerance and speed.
Ok, so for arguments sake, let's say 24 bit audio is impossible with modern technology. So where is the future of audio in my opinion?
Higher sampling frequency –not higher bit depth. A higher sampling frequency allows a more precise picture of the levels that are available. This allows for more levels to be portrayed more accurately.
What does this mean? DSD (Direct Stream Digital) and 1-bit audio. Fewer levels translates to less potential amplitude distortion. But with the extreme sampling frequencies (64x that of a standard CD, sometimes 128x at up to a 5.6mHz sampling frequency) we are able to reduce the distortion caused by the inaccuracy of quantization (quantization error).
Of course this is still limited by the power supply and implementation of such technologies. But the simplicity allows for much greater potential results and less mathematical complexity.
DSD is not main stream, but with the use of FPGAs and even video chips we are able to take even a PCM signal and convert it to DSD for potentially more accurate processing. There have been mixed results with this approach, but of course it's all about the implementation not about the limitations of the conversion itself.
Over the next year or two you will see more and more DSD-capable systems taking advantage of the 1-bit downloads available on the Internet.
So you can spend your time playing around with high-resolution 24 bit audio, or you can save your time by optimizing your digital front end to sound great even at low resolutions. On Core Audio Technology gear even streaming audio from Spotify sounds absolutely incredible. Of course uncompressed and lossless audio will sound marginally better than lossy formats, but that's not the argument I'm making here. I'm sure in years to come we'll have the noise tolerances and speed available to us for processing higher bit depth audio, but right now it's pretty much out of reach.