758
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Test retest variability in stereoacuity measurements

, PhDORCID Icon & , PhDORCID Icon

ABSTRACT

Background: A clinician’s choice of stereotest is influenced by the robustness of the measurement, in terms of sensitivity, specificity and test–retest variability. In relation to the latter aspect, there are limited data on the test–retest variability of these new tests and how they compare to the more commonly used stereotests. Therefore, the aim of the study was to determine the test–retest variability of four different measures of stereoacuity (TNO, Frisby, Lang Stereopad and Asteroid (Accurate STEReotest On a mobIle Device)) and to compare the stereoacuity measurements between the tests in an adult population. Methods: Stereoacuity was measured twice using TNO, Frisby, Lang Stereopad and Asteroid. Inclusion criteria included adult participants (18 years and older), no known ophthalmic condition and VA (Visual Acuity) equal to or better than 0.3 logMAR (Logarithm of the Minimum Angle of Resolution) with interocular difference of less than 0.2 logMAR. Bland–Altman analysis was used to assess agreement within and between stereotests. Differences in stereo thresholds were compared using signed Wilcoxon tests. Results: Fifty-four adults (male: 23 and female: 31) with VA equal to or better than 0.3 logMAR in either eye and interocular difference less than 0.2 logMAR were assessed (mean age: 38 years, SD: 12.7, range: 18–72). The test–retest variability of all the clinical stereotests, with the exception of the Lang Stereopad (p = .03, Wilcoxon signed-rank test), was clinically insignificant as the mean bias was equal or less than 0.06 log seconds of arc (equivalent to 1.15 seconds of arc). While the Asteroid test had the smallest variation between repeated measures (mean bias: −0.01 log seconds of arc), the Frisby and Lang Stereopad tests had the narrowest and widest limits of agreement respectively. When comparing results between tests, the biggest mean bias was between Frisby and Lang Stereopad (−0.62 log seconds of arc), and 64.8% and 31.5% of differences were in the medium (21–100” of arc) and larger (>100” of arc) ranges respectively. Conclusion: The TNO and Frisby tests have good reliability but measure stereoacuity over a narrower range compared to the Asteroid which shows less variation on repeated testing but has a larger testing range. The data reported here show varying degrees of agreement in a cohort of visually normal participants, and further investigation is required to determine if there is further variability when stereoacuity is reduced.

Introduction

Stereopsis is the highest grade of binocular vision, which is a measure of a person’s ability to detect depth through the visual cortex’s processing of disparate retinal images. Assessment of stereoacuity is an integral component of the orthoptic investigation, with a range of clinical tests available. These tests vary significantly in many aspects of the test design, including the range of disparities measured, the presence of monocular cues and the method of presenting the disparate images, either utilizing polarizing, anaglyph, lenticular or physical/”real” depth stimuli.Citation1 A clinician’s choice of test can be influenced by the weaknesses/limitations present in some stereotests. These include the presence of monocular cues, the ability to guess a correct answer and a limited range of disparities measurable, both in terms of the overall range and fixed options within the upper and lower limits. In addition to variations between tests, there are also variations within tests, such as the more recent version of the TNO test results in lower stereoacuity in comparison to the previous version.Citation2 To limit the impact of these factors, modifications have been proposed for some tests. For example, with the Wirt fly test, the largest disparity of 3000” can easily be passed with monocular viewing, but the introduction of additional glasses that provide a monocular view requires patients to consistently provide a positive response only when disparity is present.Citation3 In addition, there are new tests available that aim to address these limitations. The Lang Stereopad minimizes the potential guess rate with up to six stereo cards presented at once,Citation4 whereas the Asteroid (Accurate STEReotest On a mobIle Device) test uses a glasses-free 3D tablet to present the stimuli within a computer game, designed to be more engaging and uses dynamic random dot pattern eliminating monocular clues. It utilizes the camera to detect test distance which results in automatic calibration of the disparity size, and an adaptive staircase is utilized to determine the stereoacuity threshold.Citation5

The purpose of assessing stereoacuity is to evaluate a patient’s current state of binocular vision to determine whether it is normal and whether it has changed in relation to the previous findings. Results obtained can influence management decisions, for example, reduced stereoacuity has been utilized as an indication for surgical intervention in intermittent distance exotropia.Citation6–8 Given the impact of the results, it is essential that the assessment and interpretation are accurate. Test accuracy can be defined by different measures such as testability (the proportion of people able to successfully complete the test), test–retest repeatability within subject and between testers and the sensitivity/specificity in detecting deficient stereopsis. To be able to interpret an individual test response, normative data are required across all age groups, and to determine whether there has been a change in stereoacuity, the test–retest variation (TRV) must be known, as these values vary between tests. Due to changes in the visual system and cognitive development, a normal value for stereopsis improves and variability reduces with increasing age during childhood, which requires normative data over the life span to ensure accuracy in the interpretation of results. Overall, the evidence shows an improvement with age, plateauing around the age of 10 years.Citation1 Normative data for adults encompass a range of tests including TNO, Preschool Randot Stereotest, Frisby, Distance Randot and the FD2.Citation9–11

As shown, there is evidence relating to the normative data in the adult population; however, it is limited in relation to the TRV. Identical scores on repeated testing have been found in 25–73% of children,Citation12–15 with the validity of an abnormal result being considered questionable for some tests due to the considerable variation,Citation16 but the contribution the variable cognitive abilities of children make on this finding is not known. For adults, only two reports were found which assessed the Randot circles, Asteroid, Frisby and Titmus,Citation17,Citation18 with none found for the Lang Stereopad or TNO.

As test choice is influenced by the patient’s cognitive ability or clinician preference, comparison between tests may also be of interest if different stereoacuity tests are used throughout a patient’s care. Reports have shown a moderate-to-good level of correlation between a range of tests,Citation17 as anticipated given that they are all measures of the same visual function, but variability does exist, in particular, between varying methods of stimulus presentation (e.g. real depth compared to Randot testCitation19 or contour-based circles and random dot presentation).Citation20 As the Asteroid test is relatively new, the only comparisons found did not include the commonly used Frisby test, thus warranting further investigation. As the Frisby and TNO tests are the most commonly used tests in the UK,Citation21 comparison with these tests is important to facilitate comparison with clinical standard tests. Therefore, the aims of this study were to determine the test–retest variability of four different measures of stereoacuity (TNO, Frisby, Lang Stereopad and Asteroid) and to compare the stereoacuity values between the tests in an adult population.

Materials and methods

Participants

Adult participants were recruited from the friends and family of the investigators and the University of Liverpool network. Inclusion criteria were participants aged 18 years or older and no known ophthalmic conditions. Participants were also required to have VA (Visual Acuity) equal to or better than 0.3 logMAR (Logarithm of the Minimum Angle of Resolution) in either eye and an interocular difference of less than 0.2 logMAR. All participants were required to provide consent prior to participation. This study was approved by the University of Liverpool ethics committee and followed the tenets of the Declaration of Helsinki.

Test procedures

Testing was performed under standard illumination (500 lux). Participants with distance or near refractive errors wore their habitual correction during testing. Uniocular visual acuity was measured (per letter) using the ETDRS (Early Treatment of Diabetic Retinopathy Study) chart for near and distance. Twenty-four different permutations for the order of testing TNO, Frisby, Lang Stereopad and Asteroid (Version 1.0.42) were used to minimize any bias. The tests were repeated for each participant resulting in two measurements per stereotest. Stereoacuity threshold was recorded for the smallest disparity correctly identified on three presentations for the Lang Stereopad and Frisby. The Lang Stereopad was measured as a four-alternative forced choice test, and Frisby was measured to threshold at 10 cm increments. If a participant gave an incorrect response, the previous distance was repeated to ensure that threshold was reached at that level. Participants were required to identify both targets (plates V–VII) at each threshold for TNO.

Participants were asked to view each of the stereotests while placing their head on a chin rest to avoid parallax and use of monocular cues. Test distances were marked on a table, where the tests were placed on a box at eye level to ensure accuracy. For the Asteroid test, a distance tracker sticker was placed on the participant’s forehead, and the participant was instructed to hold the pad.

Statistical methods

Each test had a different range of measurable stereoacuity in seconds of arc (TNO: 480–15”, Frisby: 600–20”, Lang Stereopad: 800–50” and Asteroid: 1000–1.25”). Therefore, an arbitrary high value of 3000” was used to indicate that there was no measurable stereopsis. All stereo data were log-transformed to allow statistical analysis as the log thresholds are closer to a normal distribution. Normality tests were conducted using the Kolmogorov–Smirnov test, and despite log transformation, the stereo thresholds were not normally distributed. Therefore, non-parametric tests were used during the analysis. Differences in thresholds between the stereotests were examined with the use of Wilcoxon signed-rank test. A Bonferroni correction for multiple comparisons was used; therefore, significance was adjusted to p < .008. Bland–Altman analysis was employed for assessing agreement within and between the stereotests. The upper and lower limits of agreement have been defined as ±1.96 standard deviations.

Results

Demographics

Fifty-four participants were included in the study (male: 23 and female: 31), mean age 38 years (SD: 12.7, range: 18–72). The mean distance VA for both right and left eye was −0.08 logMAR (SD: 0.08 for right and 0.11 for left), and mean near VA for right and left eye was 0.02 logMAR (SD: 0.13) and −0.01 logMAR (SD: 0.12), respectively.

Test–retest variability

illustrates the stereo threshold data for each stereotest measured during tests 1 and 2. Stereo data are presented as median seconds of arc and the equivalent log as the data were not normally distributed. The closest agreement (smallest mean bias) between tests 1 and 2 is for the Asteroid test followed by the Frisby and TNO and the largest variability being for the Lang Stereopad test. No significant differences were found between stereo thresholds measured during tests 1 and 2 for all tests except the Lang Stereopad (p = .03, Wilcoxon signed-rank test). However, the difference between stereoacuity between tests 1 and 2 for TNO did not reach statistical significance (p = .05, Wilcoxon signed-rank test), with the Interquartile Range [IQR] being slightly larger on the second measurement, indicating that there is some variability. The IQR is lower on the second attempt (test 1 = 150” of arc and test 2 = 350” of arc) for the Lang Stereopad, and more participants (N = 19) had a better median stereo threshold during test 2, suggesting a possible practice effect. There is a very little change in the IQR for Asteroid (p = .99, Wilcoxon signed-rank test) and Frisby (p = .14, Wilcoxon signed-rank test) between tests 1 (IQR = 72.75” of arc) and 2 (IQR = 108.50” of arc), indicating that there is less variability ().

Table 1. Median, interquartile range presented in seconds of arc and log values. Paired comparison (using Wilcoxon signed-rank test) of stereo thresholds within each stereotest (tests 1 and 2) (N = 54) (*p < .05).

compares the results from tests 1 and 2 for each of the stereotests. The numbers to the left of the points on each of the figures represent more than one data point and where there is no number that represent a single data point. There are no overlapping points on the Asteroid test–retest figure. Frisby has the least amount of variability between tests 1 and 2, and while there are no overlapping points for the Asteroid in , the points are closely clustered, and the limits of agreement are less than those of Lang Stereopad ().

Figure 1. Scatterplot of each stereotest, test 1 vs test 2 for all stereotests. Numbers to the left of the point indicate number of overlapping points and those without a number indicate a single value.

Figure 1. Scatterplot of each stereotest, test 1 vs test 2 for all stereotests. Numbers to the left of the point indicate number of overlapping points and those without a number indicate a single value.

Between test comparisons

Paired comparisons across stereotests were conducted using the value obtained on test 2 to minimize any practice effects. Although most pairwise comparisons were significant (p < .008, Wilcoxon signed-rank test with Bonferroni correction) with the exceptions of the TNO vs Asteroid and Lang Stereopad vs Asteroid, this may be due to the spread of values for the Asteroid test (). The median thresholds measured for TNO and Asteroid were very similar to a mean difference of −0.07 logarcsec (). However, while stereoacuity measured with Lang Stereopad was similar to the Asteroid (mean bias: 0.16), the measurements with Lang Stereopad (2.00 log = 100” of arc) were slightly worse than the Asteroid (1.91 log = 81.50” of arc) which is significant at the conventional level of P < .05 though not after the Bonferroni correction ().

Table 2. Paired analysis between stereotests (Wilcoxon signed-rank test) (N = 54). *Significant at <0.008 (Bonferroni correction for multiple comparisons). A positive mean bias means the first test had a higher stereoacuity score.

As there are overlapping data points on the scatterplots (), summarizes the test differences into small (<21”), medium (21–100”) and large (>100”). When comparing the TNO and Frisby, two-thirds of the differences were either small or medium (), whereas large differences (41%) were seen between Lang Stereopad and Asteroid () despite the non-significant comparison between them (). Similarly, despite there being no significant difference between TNO and Asteroid on pairwise testing, 59% and 37% of differences between the two tests are medium and large, respectively ().

Figure 2. Scatterplot of comparison of each stereotest. Numbers to the left of the point indicate number of overlapping points and those without a number indicate a single value.

Figure 2. Scatterplot of comparison of each stereotest. Numbers to the left of the point indicate number of overlapping points and those without a number indicate a single value.

Figure 3. Percentages of differences between tests graded as small (<21” of arc), medium (21–100” of arc) and large (>100” of arc).

Figure 3. Percentages of differences between tests graded as small (<21” of arc), medium (21–100” of arc) and large (>100” of arc).

Discussion

In this cohort of visually normal adult participants, all stereotests, except the Lang Stereopad, had minimal (≤0.06 log seconds of arc equivalent to 1.15 seconds of arc) test–retest variability. However, when comparing results between tests, there were significantly different levels of stereoacuity measurements between each test.

On test–retest analysis, the Asteroid test had the smallest mean bias followed by the Frisby and TNO, indicating that it is the least variable on repeated testing. The wider limits of agreement for Lang and Asteroid may be due to the wider range over which stereoacuity can be measured with these tests, with the range of the Lang being 750” and Asteroid 999”, compared to the 465” of TNO and 580” of Frisby. The repeatability for Frisby (mean bias: 1” of arc) is in agreement with a study reporting good repeatability for Frisby (mean bias: 2” of arc) in a study of young adults with normal binocular vision, and 89% of their participants achieved the lowest disparity (20” of arc).Citation18 The limits of agreement for Asteroid are slightly wider, but the mean bias in our study was smaller than that reported by McCaslin et al.Citation17 Their study had 39 participants, comprising of children and adults younger than 50 years of age (mean bias: 0.058 log arcsec, 95% limits of agreement: ±0.370). While there was less than 0.1 log seconds (1.25” of arc) mean bias for TNO, Frisby and Asteroid, there was a statistically significant difference in test–retest for the Lang Stereopad (mean bias = 0.14 log seconds equivalent to 1.39” of arc, p = .03) and is contrary to Rowe et al.’sCitation4 analysis on a subset of their participants (N = 36, p = .425). However, the mean bias (1.39”) is not clinically significant and may be explained by the practice effect as the interquartile range decreased on the second attempt.

When comparing measurements between tests, TNO and Frisby appear the closest in terms of the biggest proportion of results with small or medium difference (), despite the different methods of presentation. Our data () support the typical finding of TNO measuring higher (worse) thresholds of stereoacuity in observers with binocular vision,Citation22 and a possible explanation is the use of anaglyph 3D glasses as this has been shown to produce artifacts when testing binocular vision due to the potential interocular contrast differencesCitation23 and reduction in binocular motor fusion.Citation22 The red-green glasses are also dissociative due to the color mismatch which has been shown to reduce stereopsis.Citation24 Red-green glasses also have been reported to allow significant cross talk whereby part of the image that is presented to one eye passes through the filter of the another eye and reduces stereoacuity.Citation25 Frisby, on the other hand, is described as measuring “real depth”,Citation19 where it does not involve the use of polarizing filters or anaglyph glasses to appreciate depth. Serrano-Pedraza et al.Citation25 described Frisby as using physical depth, whereby motion parallax can be used to detect the target without using stereopsis. However, in our study, a chin rest was used to minimize the impact of motion parallax on the stereo threshold. Hence, the differences are due to inherent modes of presentation of the stereotests, for example, anaglyph versus “real depth” measurements as opposed to an artifact of the methodology used in this study.

While there was no significant difference between TNO and Asteroid with a very small mean bias, a large proportion of individuals (59%) had medium differences (21–100” of arc) in stereo measurements between TNO and Asteroid (). This may be explained by the difference in the range measured by each of these tests, where TNO and Frisby have narrower testing ranges compared to the Lang Stereopad and Asteroid (TNO: 480–15” and Asteroid: 1000–1.25”) as well as the different modes of presentation, for example, dot size or static vs dynamic random dot. Asteroid uses a dynamic random dot stereogram to eliminate monocular cues and will produce erroneous results if it is held too close or tilted to detect the target using motion parallax.Citation5 The higher thresholds obtained with the Asteroid have been explained by the differences in dot size compared to other tests based on random dot stereograms and that it is dynamic as opposed to static in the other tests.Citation5

Study limitations

Testing visually normal participants does allow us to evaluate the test efficacy, but as it is known, that reduced stereoacuity can impact on the variability.Citation1 Hence, further evaluation is required in a wider clinical cohort of varying abilities and to determine how these measures would be affected by different ophthalmic conditions, for example, in patients with amblyopia and/or strabismus. There is evidence to suggest that stereoacuity declines with age,Citation10,Citation11,Citation26,Citation27 but as comparisons are within participants, this does not impact on the conclusions of this study. A further potential source of variability may be explained by the fact that stereoacuity was measured by two students studying for a Nuffield Science Project, and while they were trained and had a strict protocol to adhere to, there is potential for variation from an experienced orthoptist.

Summary

Standard clinical tests, TNO and Frisby have good reliability but cannot be used interchangeably. Therefore, stereotest selection for patients should remain constant between visits. The Asteroid had good reliability and compared well with the TNO in adults with good visual acuity. However, further evaluation of these tests is required to determine reliability in larger cohorts in patients with impaired stereoacuity and in older adults with normal visual acuity and binocular vision.

Acknowledgements

Thanks to the Nuffield students; Ailin Anto and Danah Al-Khateeb for the stereo data collection.

Disclosure statement

Dr Anna O’Connor was an unpaid adviser on the Asteroid Stereo Project led by Jenny Read. The Asteroid stereotest was used in this project under a research license and provided by the Asteroid Stereo Project Team.

Additional information

Funding

The author(s) reported that there is no funding associated with the work featured in this article.

References

  • O’Connor AR, Tidbury LP. Stereopsis: are we assessing it in enough depth? Clin Exp Optometry. 2018;101(4):485–494. doi:10.1111/cxo.12655.
  • van Doorn LL, Evans BJ, Edgar DF, Fortuin MF. Manufacturer changes lead to clinically important differences between two editions of the TNO stereotest. Ophthalmic Physiol Opt. 2014;34(2):243–249. doi:10.1111/opo.12101.
  • De La Cruz A, Morale SE, Jost RM, Kelly KR, Birch EE. Modified test protocol improves sensitivity of the stereo fly test. Am Orthop J. 2016;66(1):122–125. doi:10.3368/aoj.66.1.122.
  • Rowe FJ, Hepworth LR, Howard C, Chean CS, Mistry M. Comparative analysis of the Lang Stereopad in a non-clinic population. Strabismus. 2019;27(3):182–190. doi:10.1080/09273972.2019.1643893.
  • Vancleef K, Serrano-Pedraza I, Sharp C, et al. ASTEROID: a new clinical stereotest on an autostereo 3D tablet. Transl Vis Sci Technol. 2019;8(1):25. doi:10.1167/tvst.8.1.25.
  • Holmes JM, Birch EE, Leske DA, Fu VL, Mohney BG. New tests of distance stereoacuity and their role in evaluating intermittent exotropia. Ophthalmology. 2007;114(6):1215–1220. doi:10.1016/j.ophtha.2006.06.066.
  • Stathacopoulos RA, Rosenbaum AL, Zanoni D, et al. Distance stereoacuity: assessing control in intermittent exotropia. Ophthalmology. 1993;100:495–500.
  • Sharma P. The pursuit of stereopsis. J AAPOS. 2018;22: 2:e1–2 e5.
  • Piano ME, Tidbury LP, O’Connor AR. Normative values for near and distance clinical tests of stereoacuity. Strabismus. 2016;24:169–172.
  • Garnham L, Sloper JJ. Effect of age on adult stereoacuity as measured by different types of stereotest. Br J Ophthalmol. 2006;90(1):91–95. doi:10.1136/bjo.2005.077719.
  • Bohr I, Read JC. Stereoacuity with Frisby and revised FD2 stereo tests. PLoS One. 2013;8(12):e82999. doi:10.1371/journal.pone.0082999.
  • Schmidt PP. Vision screening with the RDE stereotest in pediatric populations. Optom Vis Sci. 1994;71(4):273–281. doi:10.1097/00006324-199404000-00008.
  • Schmidt P, Maguire M, Kulp MT, Dobson V, Quinn G, Vision in Preschoolers Study G. Random Dot E stereotest: testability and reliability in 3- to 5-year-old children. J AAPOS. 2006;10:507–514.
  • Fawcett SL, Birch EE. Interobserver test-retest reliability of the Randot preschool stereoacuity test. J Am Assoc Pediatr Ophthalmol Strabismus. 2000;4(6):354–358. doi:10.1067/mpa.2000.110340.
  • Simons K. Stereoacuity norms in young children. Arch Ophthalmol. 1981;99(3):439–445. doi:10.1001/archopht.1981.03930010441010.
  • Adler P, Scally AJ, Barrett BT. Test–retest variability of Randot stereoacuity measures gathered in an unselected sample of UK primary school children. Br J Ophthalmol. 2012;96(5):656–661. doi:10.1136/bjophthalmol-2011-300729.
  • McCaslin AG, Vancleef K, Hubert L, Read JCA, Port N. Stereotest comparison: efficacy, reliability, and variability of a new glasses-free stereotest. Transl Vis Sci Technol. 2020;9(9):29. doi:10.1167/tvst.9.9.29.
  • Antona B, Barrio A, Sanchez I, Gonzalez E, Gonzalez G. Intraexaminer repeatability and agreement in stereoacuity measurements made in young adults. Int J Ophthalmol. 2015;8:374–381.
  • Leske DA, Birch EE, Holmes JM. Real depth vs Randot stereotests. Am J Ophthalmol. 2006;142(4):699–701. doi:10.1016/j.ajo.2006.04.065.
  • Fawcett SL. An evaluation of the agreement between contour-based circles and random dot-based near stereoacuity tests. J Am Assoc Pediatr Ophthalmol Strabismus. 2005;9(6):572–578. doi:10.1016/j.jaapos.2005.06.006.
  • Vancleef K, Read JCA. Which stereotest do you use? A survey research study in the British Isles, the United States and Canada. Br Ir Orthopt J. 2019;15(1):15–24. doi:10.22599/bioj.120.
  • Vancleef K, Read JCA, Herbert W, Goodship N, Woodhouse M, Serrano-Pedraza I. Overestimation of stereo thresholds by the TNO stereotest is not due to global stereopsis. Ophthalmic Physiol Opt. 2017;37(4):507–520. doi:10.1111/opo.12371.
  • Simons K, Elhatton K. Artifacts in fusion and stereopsis testing based on red/green dichoptic image separation. J Pediatr Ophthalmol Strabismus. 1994;31(5):290–297. doi:10.3928/0191-3913-19940901-05.
  • Cornforth LL, Johnson BL, Kohl P, Roth N. Chromatic imbalance due to commonly used red-green filters reduces accuracy of stereoscopic depth perception. Am J Optom Physiol Opt. 1987;64(11):842–845. doi:10.1097/00006324-198711000-00007.
  • Serrano-Pedraza I, Vancleef K, Read JC. Avoiding monocular artifacts in clinical stereotests presented on column-interleaved digital stereoscopic displays. J Vis. 2016;16:13.
  • Lee SY, Koo NK. Change of stereoacuity with aging in normal eyes. Korean J Ophthalmol. 2005;19(2):136–139. doi:10.3341/kjo.2005.19.2.136.
  • Zaroff CM, Knutelska M, Frumkes TE. Variation in stereoacuity: normative description, fixation disparity, and the roles of aging and gender. Invest Ophthalmol Vis Sci. 2003;44(2):891–900. doi:10.1167/iovs.02-0361.