Analysis of Replication Studies

Linking the Causal Replication Framework to Analysis: Assessing Replication Success

Currently, there is no consensus among researchers about how replication success should be determined/assessed. Thus, replicators have relied on a constellation of ad hoc approaches for judging replication success. This has resulted in tremendous variation in how researchers and policy-makers have interpreted replication results, subjecting the validity of these conclusions to researcher bias.

The most popular measure for assessing correspondence in results is based on the equivalence in significance patterns – that is, if both studies reveal a statistically significant effect, then replication success is achieved (Maxwell, Lau, & Howard, 2015). This correspondence measure examines the policy question of whether two studies arrives at the same conclusion. Other policy-related correspondence measures include looking at the size and direction of impact estimates across both study arms. To examine the methodological question of whether effect estimates from two studies are “close” enough to conclude that they replicate, researchers frequently examine the significance of the difference in effect estimates using a two-sample t-test (Valentine et al., 2011). Very rarely, researchers have used the more appropriate equivalence test (Anderson & Maxwell, 2016; Cumming & Maillardet, 2006; Steiner & Wong, 2018). In cases where there are results from multiple replication efforts, meta-analytic approaches may be used to estimate the average size and variation of effect estimates (Valentine et al., 2011; Hedges & Schauer, 2018).

The Challenge with Difference Tests for Assessing Replication Success

Although examining whether two studies produce the same significance pattern is intuitively appealing as a measure of replication success, there are multiple limitations with this approach. First, whether the two studies succeed or fail in demonstrating a significant effect depends strongly on the unknown “true” causal effect. With small or medium-sized true effects, it is much more likely to obtain contradictory test results from the two studies (one significant, the other insignificant) than with large true effects. Second, even when both studies have a power of 0.8 with respect to the true effect, the probability of obtaining a significant result from both studies is not very high, it is 0.8 x 0.8 = 0.64. Third, when both studies are underpowered, the probability that they fail to reject the null hypothesis of no effect is very high. This would result in “successful replication” of a potentially false null effect. For example, assume that both studies have a power of 0.4 to detect a true effect. The probability of failing to reject the false null hypothesis is 1 - 0.4 x 0.4 = 0.84. Fourth, in post-hoc replication designs where replication of the original study’s significance is the goal, there are challenges with obtaining even an appropriate estimate of the true effect. The issue here is that the published effect size estimates (from original studies) are typically upward biased due to publication bias, contain sampling variability that is regularly not accounted for, and are uninformative about potential effect heterogeneities that can be quite considerable across replications (Howard et al., 2009; McShane & Böckenholt, 2014). Though there have been some proposed solutions for mitigating challenges that arise in replicating significance patterns in post-hoc designs (Anderson & Maxwell, 2016; McShane & Böckenholt, 2014), we recommend examining the difference in effect estimates as a measure of replication success because the metric does not depend on an unknown true effect.  

Although two-sample t-tests for the difference in two estimands (including comparing overlapping confidence intervals of two study results) are also often used in the replication literature, a careful consideration of the typical research question in replication designs suggests serious weaknesses with this approach as well. Traditionally, standard null hypothesis significance testing (NHST) protects against Type I error rates and there is less concern about Type II errors (where the researcher concludes that there is no evidence for a difference when a difference in the true effects actually exists). As a result, under the NHST framework, results may be inconclusive when tests of difference are used to assess whether two effect estimates are the same (Tryon, 2001; Tryon & Lewis, 2008). In cases where two estimates are extremely different and the overall replication design is adequately powered, then there may be evidence to reject the null of equivalent effects. But when effect estimates are similar, or when power is low in the overall replication design, then the researcher may be uncertain about how to interpret the lack of evidence to reject the null hypothesis. It could mean that there is no difference in treatment effect estimates, or it could mean a Type II error particularly when the two studies together are underpowered for detecting effect differences.  

Correspondence Tests for Assessing Replication Success

Building off Tryon and Lewis’ work on equivalence tests (2008), Steiner and Wong (2018) recommend using a correspondence test to assess the comparability of results in design-replication studies. The correspondence test combines the statistical tests of difference and equivalence into a single framework. Here, the test of difference is the standard t-test in the NHST framework and the test of equivalence is a test for whether an observed difference in estimates is significantly smaller than a pre-specified tolerance threshold. Thus, the equivalence test helps in rejecting the competing explanation of equivalence due to sampling error. The tolerance threshold required for the equivalence test is the smallest difference in effect estimates that would be considered substantively and methodologically inconsequential or trivial. For example, a difference of 0.05 SD between the original and replication study may be considered as tolerable. The correspondence test combines the difference and equivalence test into one procedure to yield one of four conclusions regarding the causal estimands of study 1 and 2: statistical equivalence, statistical difference, trivial difference (i.e. there is evidence of statistical difference, but the magnitude of the difference is smaller than the tolerance threshold), and statistical indeterminacy. Indeterminacy is indicated when the two studies are insufficiently powered such that neither a significant difference nor a significant equivalence can be established (i.e., due to small sample sizes or large error variances in at least one of the two studies).

There are several advantages of the correspondence test. First, it clearly indicates when the replication design is underpowered for determining replication success or failure. Second, the correspondence test provides the researcher with a statistical test that explicitly addresses the methodological question of interest in most replication designs: Do the two studies produce the same result with respect to the underlying causal estimand? This obviates any ambiguity from failing to reject a null hypothesis of “no difference.” Third, for the correspondence test to be implemented, it requires the researcher to choose in advance a tolerance threshold that separates inconsequential or trivial differences from meaningful difference in causal estimands. Fourth, it provides a researcher with clear decision criteria for assessing replication success. Fifth, the procedure is a straight-forward extension of NHST and can be easily implemented in Excel, common statistical software such as R, or through graphical comparisons of (adjusted) inferential confidence intervals (see Tryon & Lewis, 2008).

Another advantage of the equivalence and correspondence test is that we can compute the minimum detectable equivalence threshold (MDET), a new measure we propose for assessing the tests’ post-hoc “power” to establish equivalence (Steiner & Wong, 2018). This is particularly helpful whenever an underpowered correspondence test with a given tolerance threshold results in indeterminacy. The MDET reports the minimum threshold at which equivalence would have been detected for a given study. Large MDETs, say larger than 0.3 SD, indicate insufficient power for demonstrating effect differences less than 0.1 or 0.2 SD. In our guidelines, we will recommend that researchers should always report the MDET, independent of the correspondence test’s outcome. This may be done by gradually increasing the tolerance threshold (e.g., by increments of 0.01 SD) until the correspondence test indicates equivalence (instead of indeterminacy). Our analysis tools for the correspondence test will automatically compute and report the MDET.

Anderson and Maxwell (2017) argue that insufficiently powered replication designs could be an important reason for replication failure. Though formulas and calculators for determining required sample sizes are available for the correspondence in significance patterns of post-hoc designs (Anderson and Maxwell, 2017; McShane & Böckenholt, 2014), no specialized tools for equivalence and correspondence tests have yet been developed. Given the importance of adequately powered replication efforts, the project will investigate power considerations for assessing replication success and develop corresponding power calculators that can be used to determine minimum sample size requirements for post-hoc and prospectively planned replication designs.


Further Reading

Steiner PM, Wong VC. Assessing Correspondence Between Experimental and Nonexperimental Estimates in Within-Study Comparisons. Evaluation Review. 2018;42(2): 214-247. doi: 10.1177/0193841X18773807


Citations

Anderson, S. F., & Maxwell, S. E. (2016). There’s more than one way to conduct a replication study: Beyond statistical significance. Psychological Methods, 21(1), 1–12. https://doi.org/10.1037/met0000051

Cumming, G., & Maillardet, R. (2006). Confidence Intervals and Replication: Where Will the Next Mean Fall? Psychological Methods, 11(3), 217–227. https://doi.org/10.1037/1082-989X.11.3.217

Hedges, L. V. (1984). Estimation of Effect Size under Nonrandom Sampling: The Effects of Censoring Studies Yielding Statistically Insignificant Mean Differences. Journal of Educational Statistics, 9(1), 61–85. https://doi.org/10.2307/1164832

Howard, G. S., Lau, M. Y., Maxwell, S. E., Venter, A., Lundy, R., & Sweeny, R. M. (2009). Do research literatures give correct answers? Review of General Psychology, 13(2), 116–121. https://doi.org/10.1037/a0015468

Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70(6), 487-498. https://doi.org/10.1037/a0039400

McShane, B. B., & Böckenholt, U. (2014). You Cannot Step Into the Same River Twice: When Power Analyses Are Optimistic. Perspectives on Psychological Science, 9(6), 612–625. https://doi.org/10.1177/1745691614548513

Tryon, W. W. (2001). Evaluating statistical difference, equivalence, and indeterminacy using inferential confidence intervals: An integrated alternative method of conducting null hypothesis statistical tests. Psychological Methods, 6(4), 371-386. https://doi.org/10.1037/1082-989X.6.4.371

Tryon, Warren & Lewis, Charles. (2008). An Inferential Confidence Interval Method of Establishing Statistical Equivalence That Corrects Tryon's (2001) Reduction Factor. Psychological methods. 13. 272-7. 10.1037/a0013158. 

Valentine, J. C., Biglan, A., Boruch, R. F., Castro, F. G., Collins, L. M., Flay, B. R., … Schinke, S. P. (2011). Replication in Prevention Science. Prevention Science, 12(2), 103–117. https://doi.org/10.1007/s11121-011-0217-6

Wong, V. C., & Steiner, P. M. (2018). Designs of Empirical Evaluations of Nonexperimental Methods in Field Settings. Evaluation Review, 0193841X18778918. https://doi.org/10.1177/0193841X18778918

Previous
Previous

Implementation of Replication Studies