When your statistical results are unintuitive or counterintuitive

1 minute read

Successful modeling of a complex data set is part science, part statistical methods, and part experience and common sense (Hosmer, Lemeshow, & Sturdivant, 2013, p. 89).

  1. Individual plots

    a) Always do dependent variable ~ an independent variable first, use Yiqing Xu’s checklist

    b) Take a look at outliers—maybe your coding is wrong.

  2. Model assumptions

    a) Normal distribution: Almost all variables in translation and interpreting studies and linguistics, e.g. word length, dimension scores, and information density, have non-normal distributions, so do Shapiro-Wilk first, and use non-parametric statistics, e.g. Kruskal-Wallis test (not one-way ANOVA) and Spearman’s rank correlation (not Pearson’s correlation).

    b) Transform your data when needed (see many examples in Coupé, 2019)

    c) Adequate number of samples: Use Fisher-Yates exact test when the frequency in one cell is smaller than 5, not Pearson’s chi-square test

  3. Variable selection

    a) Always go for multivariate designs—monofactorial studies “have virtually nothing to contribute to corpus linguistics” (Gries, 2018, p. 295) and linguistics in general. Never go feature shopping!

    b) Choose features that are meaningful in relation to the language system, not the “teddy bears” (Cf. Type III error)

    c) Omitted variable bias, which cannot be detected statistically. Gather your independent variables outcome-blindly!

  4. Model evaluation

    a) Report effect sizes, aim for “good to excellent” model performances

    b) Check multicollinearity issues with variance inflation factors (VIFs) or generalised VIFs, where appropriate, remove interactions that are not absolutely necessary