Reading reports of controlled experiments critically

TIES542 Principles of programming languages, Spring 2017
Antti-Juhani Kaijanaho

An experiment (in Finnish koe) is an event where the experimenter attempts to meaure the causal influence of some intervention to some outcome measures of interest by trying something with and without the intervention and measuring the differece in outcome. For example, in the programming language context, an experimenter might try to measure the causal influence of using a statically typed language instead of a dynamically typed language on the time taken by novice programmers to solve algorithmically nontrivial programming tasks; one possible experiment they might try is having third-year CS students solve a parser-writing exercise in two languages the experimenter designed themself, one dynamically typed and the other statically typed (but otherwise very similar).

Experiment designs

An experiment typically (but not always) involves human participants (often called, rather insensitively, subjects). Usually the experimenter will ask the participants to perform some task and record one or more performance or correctness measures (such as time taken or number of errors) for each participant. An experiment will always involve some variation of the task: perhaps there will be variations of the task that are assigned to different participants, or perhaps all participants will be asked to perform a task multiple times in different variations. The goal of the experimenter is to measure an effect of these variations on the outcome measures.

The different ways the task is varied are called treatments. For example, the treatments of the experiment mentioned above are dynamically typed and statically typed. The things the experimenter measures are called outcome measures. Treatments are also sometimes called levels, and the set of treatments is sometimes called the values of the independent variable; the outcome measures are sometimes called dependent variables.

There should always be at least two treatments. An experiment cannot really deserve the label of experiment if there is only one treatment. Often, the experimenter is interested in the performance of only one treatment (often a new invention of the experimenter themself); in that situation, there should be a control treatment that represents how things have been usually done. If the experiment attempts to demonstrate the superiority of some treatment, a control treatment is essential.

There are two basic experiment designs:

A between subjects design assigns each participant one treatment and asks each participant to perform the task (as varied by the treatment) once. Thus, there will be one measurement of each outcome measure per participant.
A within subjects (or repeated measures) design assigns every participant all treatments and asks each participant to perform the task many times, with different variations. Thus, there will be multiple measurements of each outcome measure per participant.

Within-subjects experiments have a complication. Often the participants learn from earlier tasks and thus find the later tasks easier. To make it possible to take this learning effect into account, a within-subjects experiment should be counterbalanced by assigning the treatments in different order to different participants. If each possible order is assigned to some participant, then we say that the experiment is completely counterbalanced.

An experiment is randomized if the assignment of a participant to a treatment (or, in the case of a counterbalanced within-subjects experiment, treatment sequence) is made by an unpredictable process such as thowing coins or dice or using a true random number generator. Proper randomization requires that the assignment process is done exactly once, and it is decided before seeing the result whether the result of the throw is used. The main reason for randomization is to prevent the experimenter from (unconsciously or deliberately) skewing the assignment, but the use of a random procedure with a known distribution will also make it easier to interpret the results of statistical analysis.

Note that randomization is about unpredictability. If it is possible for someone to predict the result of the assignment, then the assignment (and experiment) is not randomized. For example, some researchers consider using the MD5 sum of the participant's name as the assignment criterion as randomization, but I do not (since it is perfectly predictable and thus systematic).

An experiment is blinded if the participant, the experimenter, or the data analyst cannot reliably determine which treatment the participant were assigned. A report that claims to be blinded should always specify who (participants, experimenters, data analysts etc) were blinded. Sometimes double blinded is used to mean that the research design prevents anyone involved from reliably determining which treatment a particular participant or piece of data was assigned (while preserving enough information to link treatments to results after data analysis). Blinding is very important in fields like medicine but practically impossible in computing (a programmer cannot be ignorant of what tools and methods they use in solving problems).

More complex experiments try to simultaneously measure the effect of multiple independent variables. For example, one might try to measure in the same experiment the effect of choosing between three different syntaxes of if-statements and the effect of using and not using indentation to indicate program structure.[*] Such an experiment is called a factorial experiment, wth each different independent variable being a separate factor with its own levels. A factorial experiment may be between subjects in one factor and within subjects in another.

Data analysis in experiments

Experiments are usually analyzed using inferential statistics. Thus, a critical reader of an experiment report must have some familiarity with statistical inference and its most important pitfalls (especially considering that analysis errors are very common in even peer-reviewed published reports).

For simpilcity, I will assume here that there are only two treatments. The first step of statistical inference is to determine a single number representing the effect observed. For example, in a between-subjects comparison of task completion times, this will usually be the difference between the average completion times of all participants using one treatment and the average completion times of all participants using the other treament. In a within-subjects comparison of task completion times, we instead average the individual differences observed. This effect measure will indicate both the magnitude and the direction of the observed effect.

However, we practically never get an observed effect of zero, even when there is no actual effect. It has become standard practice to perform a statistical hypothesis test to accept or reject the hypothesis (often confusingly called the null hypothesis) that there was no real effect. The rules for choosing the proper test are beyond the scope of this course. Common to all such tests is that the experimenter chooses before performing the test a so-called \(\alpha\)-threshold, which is a real number between 0 and 1 (\(\alpha = 0.05\) is a common choice). The test itself is reported by giving a so-called \(p\)-value (itself a real number between 0 and 1) along with test-specific other data. We are supposed to accept the null hypothesis if \(p > \alpha\) and reject the null hypothesis if \(p \leq \alpha\).

If we reject the null hypothesis at \(\alpha = 0.05\), the result is said to be statistically significant. It is customary to consider a statistically significant result real enough to warrant publication (if it is about a sufficiently interesting issue). However, accepting the null hypothesis is properly understood to say merely that this experiment failed to detect an effect; it does not say anything about the presence or absence of an actual effect.

An alternative to statistical hypothesis testing is to compute a confidence interval. The experimenter chooses a confidence level (a real number between 0 and 1, commonly \(0.95\)) and then computes the interval for that level. The interval is reported by giving its low and high limits (for example "the 95 % confidence interval is 3.4 to 7.6"). It is equivalent to say "we reject the null hypothesis of zero difference at \(\alpha = 0.05\)" and "the 95 % confidence interval for the difference does not include zero".

An important difficulty with statistical hypothesis testing and confidence intervals is that they are very easy to misunderstand. Rejecting the null hypothesis at \(\alpha = 0.05\) does not mean that the probability of the null hypothesis being false is \(0.05\). Similarly, the probability of the true difference being in the stated 95 % confidence interval is not \(0.95\). In fact, the frequentist theory of statistics on which these inference techniques are based expressly says that (in most cases) these probabilities are undefined. Instead, the \(\alpha\) and confidence level are error probabilities: if we behave as if the null hypothesis is false every time it is rejected at \(\alpha\), or if we behave as if the confidence interval includes the true value at confidence level \(1-\alpha\), the probability of us being in error is \(\alpha\). The difference is very subtle but also very important.

A more serious problem is that statistical inference is very brittle. There are serious claims in the literature that most published research results based on statistical inference are in fact false[*] and that, under current standards of publication in many fields, a researcher using statistical inference can easily manufacture results that are not real but can nevertheless be published in a peer-reviewed journal[*]. These issues, while extremely important for science, are beyond the scope of this course.

Assessing experiment reports

When reading a report of a controlled experiment, ask the following questions:

What are the treatments? Is there a control treatment?
What are the outcome measures used?
Is there a fair start? For example:
- Are treatments assigned to participants randomly? If not, are there some other safeguards against biased assignment?
- Is one treatment given a better chance to succeed by the way the tasks or participant instructions are designed?
Is there a fair race? For example:
- Are participants treated equally regardless of which treatment they were assigned?
Is there a fair finish? For example:
- Is the statistical analysis free of critical errors, to the extent your statistical training allows you to determine?
- Are alternative explanations for the results examined?

(The fair start/race/finish metaphor is due to Paul Glasziou. These questions are a paraphrase of Table 19, on page 168, of my doctoral dissertation[*])

Additionally, consider whether the actual treatments and outcome measurements used in the experiment really speak to the question the report says to investigate. For example, I have read a report that claimed to compare the declarative paradigm and the imperative paradigm but really compared the languages Prolog and C++. It seems to me a stretch to claim that the effect observed was due to the paradigm and not to some other aspect of these two languages.

Similarly, consider how much the results of the experiment may be applicable in the real world, considering how the participants were recruited[*] and how artificial the experimental design is.

Note that this is not a game with scores; you do not rank studies by how many good and bad answers they earn. Instead, these questions are intended to guide your own critical thinking. It is always your own judgment that should ultimately decide whether you trust a report or not.

Reading reports of controlled experiments critically

Experiment designs

Data analysis in experiments

Assessing experiment reports

Further reading