|
July 2016
2.4
Hypotheses and disputed results
Several studies intentionally considered discrimination of a high resolution format even if the content was not intended to be high resolution. In [62, 64], it was claimed that Nishiguchi 2003 did not have sufficient high frequency content. In one condition for Woszcyk 2007, a 20kHz cut-off filter was used, and in Nishiguchi 2005, the authors stated that they 'used ordinary professional recording microphones and did not intend to extend the frequency range intentionally during the recording sessions... sound stimuli were originally recorded using conventional recording microphones.' These studies were still considered in the meta-analysis of Section 3 since further investigation (e.g., spectrograms and frequency response curves in [58, 64, 68] shows that they may still have contained high frequency content, and the extent to which one can discriminate a high sample rate format without high frequency content is still a valid question. Other studies noted conditions which may contribute to high resolution audio discrimination. [25, 60, 61] noted that intermodulation distortion may result in aliasing of high frequency content, and [63] remarked on the audibility of the noise floor for 16 bit formats at high listening levels. [23] had participants blindfolded, in order to eliminate visual distractions, and [56], though finding a null result when comparing two high resolution formats, still noted that the strongest results were amongst participants who conducted the test with headphones. Together, the observations mentioned in this section provide insight into potential biases or flaws to be assessed for each study, and a set of hypotheses to be validated, if possible, in the following meta-analysis section.
2.5
Risk of bias
3.
Meta-analysis results
3.1
Binomial tests
Of note, several experiments where the authors concluded that there was not a statistically significant effect (Plenge 1980, Nishiguchi 2003), still appear to suggest that the null hypothesis can be rejected.
3.2
To what extent does training affect results?
The statistic I2 measures the extent of inconsistency among the studies' results, and is interpreted as approximately the proportion of total variation in study estimates that is due to heterogeneity (differences in study design) rather than sampling error. Similarly, a low p value for heterogeneity suggests that the tests differ significantly, which may be due to bias.
The results are striking. The training subgroup reported an overall strong and significant ability to discriminate high resolution audio. Furthermore, tests for heterogeneity gave I2=0% and p=0.59, suggesting a strong consistency between those studies with training, and that all variation in study estimates could be attributed to sampling error. In contrast, those studies without training had an overall small effect. Heterogeneity tests reveal large differences between these studies I2=23%, though this may still be attributed to statistical variation, p=0.23. Contrasting the subgroups, the test for subgroup differences gives I2 =95.5% and p<10-5, suggesting that almost all variation in subgroup estimates is due to genuine variation across the 'Training' and 'No training' subgroups rather than sampling error.
3.3
How does duration of stimuli and intervals affect results?
Unfortunately, statistical analysis of the effect of duration of stimuli and intervals is difficult. Of the 18 studies suitable for meta-analysis, only 12 provide information about sample duration and 6 provide information about interval duration, and many other factors may have affected the outcomes. In addition, many experiments allowed test subjects to listen for as long as they wished, thus making these estimates very rough approximations. Nevertheless, strong results were reported in Theiss 1997, Kaneta 2013A, Kanetada 2013B and Mizumachi 2015, which all had long intervals between stimuli. In contrast, Muraoka 1981 and Pras 2010 had far weaker results with short duration stimuli. Furthermore, Hamasaki 2004 reported statistically significant stronger results when longer stimuli were used, even though participant and stimuli selection had more stringent criteria for the trials with shorter stimuli. This is highly suggestive that duration of stimuli and intervals may be an important factor. A subgroup analysis was performed, dividing between those studies with stated long duration stimuli and/or long intervals (30 seconds or more) and those which state only short duration stimuli and/or short intervals. The Hamasaki 2004 experiment was divided into the two subgroups based on stimuli duration of either 85-120s or approx. 20s [62, 64]. The subgroup with long duration stimuli reported 57% correct discrimination, whereas the short duration subgroup reported a mean difference of 52%. Though the distinction between these two groups was far less strong than when considering training, the subgroup differences were still significant at a 95% level, p=0.04. This subgroup test also has a small number of studies (14), and many studies in the long duration subgroup also involved training, so one can only say that it is suggestive that long durations for stimuli and intervals may be preferred for discrimination.
3.4
Effect of test methodology
We performed subgroup tests to evaluate whether there are significant differences between those studies where subjects performed a 1 interval forced choice 'same/different' test, and those where subjects had to choose amongst two alternatives (ABX, AXY, or XY 'preference' or 'quality'). For same/different tests, heterogeneity test gave I2=67% and p=0.003, whereas I2=43% and p=0.08 for ABX and variants, thus suggesting that both subgroups contain diverse sets of studies (note that this test has low power, and so more importance is given to the I2 value than the p value, and typically, a is set to 0.1 [77]). A slightly higher overall effect was found for ABX, 0.05 compared to 0.02, but with confidence intervals overlapping those of the 1IFC 'same/different' subgroup. If methodology has an effect, it is likely overshadowed by other differences between studies.
3.5
Effect of quantisation
Only a small number of studies considered perception of high resolution quantization (beyond 16 bits per sample). Theiss 1997 reported 94.1% discrimination for one test subject comparing 96kHz/24-bit to 48kHz/16-bit, and the significantly lower 64.9% discrimination over two subjects comparing 96kHz/16-bit to 48kHz/16-bit. Jackson 2014 compared 192kHz to 44.1 kHz and to 48kHz with different quantizers. They found no effect of 24 to 16 bit reduction in addition to the change in sample rate. Kanetada 2013A, Kanetada 2013B and Mizumachi 2015 all found strong results when comparing 16 to 24 bit quantization. Notably, Kanetada 2013B used 48 kHz sample rate for all stimuli and thus focused only on difference in quantization. However, Kanetada 2013A, Kanetada 2013B and Mizumachi 2015 all used undithered quantization. Dithered quantization is almost universally preferred since, although it increases the noise floor, it reduces noise modulation and distortion. But few have looked at perception of dither. [80] dealt solely with perception of the less commonly used subtractive dither, and only at low bit depths, up to 6 bits per sample. [81] investigated preference for dither for 4 to 12 bit quantizers in two bit increments. Interestingly, they found that at 10 or 12 bits, for all stimuli, test subjects either did not show a significant preference or preferred undithered quantization over rectangular dither and triangular dither for both subtractive and nonsubtractive dither. Jackson 2014 found very little difference (over all subjects and stimuli) in discrimination ability when dither was or was not applied. Thus, based on the evidence available, it is reasonable to include these as valid discrimination experiments even though dither was not applied.
3.6
Is there publication bias?
3.7
Sensitivity Analysis
Though the studies are diverse in their approaches, we considered fixed effect models in addition to random effect models. These give diminished (but still significant) results, primarily because large studies without training are weighed highly under such models. We also considered treating the studies as yielding dichotomous rather than continuous results. That is, rather than mean and standard error over all participants, we simply consider the number of correctly discriminated trials out of all trials. This approach usually requires an experimental and control group, but due to the nature of the task and the hypothesis, it is clear that the control is random guessing, i.e., 50% correct as number of trials approaches infinity. This knowledge of the expected behavior of the control group allows use of standard meta-analysis approaches for dichotomous outcomes. Treating the data as dichotomous gave stronger results, even though it allowed inclusion of Meyer 2007, which was one of the studies that most strongly supported the null hypothesis. Use of the Mantel-Haenszel (as opposed to Inverse Variance) meta-analysis approach with the dichotomous data had no influence on results. A full description of the statistical methods used for continuous and dichotomous results, fixed effects and random effects, and the Inverse Variance and Mantel-Haenszel methods, is given in the Appendix. Many studies involved several conditions, and some authors participated in several studies. Treating each condition as a different study (a valid option since some conditions had quite different stimuli or experimental set-ups) or merging studies with shared authors was performed for dichotomous data only, since it was no longer possible to associate results with unique participants. Treating all conditions as separate studies yielded the strongest outcome. This is partly because some studies had conditions giving opposite results, thus hiding strong results when the different conditions were aggregated. Finally, we considered focusing only on sample rate and bandwidth (removing those studies that involved changes in bit depth) or only those using modern digital formats (removing the pre2000s studies that used either analogue or DAT systems). Though this excluded some of the studies with the strongest results, it did not change the overall effect. Though not shown in Table 5, all of the conditions tested gave an overall effect with p<0.01, and all showed far stronger ability to discriminate high resolution audio when the studies involved training.
|
|