D. Fucci, S. Romano, M. T. Baldassarre, D. Caivano, G. Scanniello, B. Thuran, and N. Juristo. A Longitudinal Cohort Study on the Retainment of Test-Driven Development. arXiv e-prints, page arXiv:1807.02971, Jul 2018.
I won’t pretend that I think this is an example of good research. However, it is an example of a research paper that can teach us a lot by critically examining it to understand both the good and the bad parts.
The authors wanted to investigate “to what extent can novice software developers retain TDD and its effects (if any) over a period of five months?” To this end they created a study of third-year students in the “Integration and Testing” course at the University of Bari in Italy. Involvement in the study was voluntary and participants were rewarded with a bonus to their final mark. A total of 30 students took part.
The 30 students were randomly split into 2 groups. The purpose of the groups was to alter the task order between the groups and thereby control for simply finding a difference due to task differences. The design also incorporated the syllabus of the “Integration and Testing” course whereby the initial measurement occurred after some testing training but before TDD training and the second measurement followed the course’s TDD training.
Over the course of approximately 6 months they measured the students’ performance four times by asking the students to complete a programming task (BSK = Bowling Scoring Kata, MRA = Mars Rover API, SSH = Spreadsheet logic, GOL = Game of Life) given the direction to either use TDD or YW (“Your Way”, in which the student chooses how to approach testing and building up the solution). Each programming task had an acceptance test suite, which was later used to evaluate each students’ response.
Using this design the authors intended to test six different hypotheses. The three dependent variables they measured (taken from other studies into TDD) were:
- QLTY (Quality) measured by the average of the acceptance tests pass rate of each attempted story
- PROD (Productivity) measured by the pass rate of the entire acceptance test suite
- TEST measured by the number of tests the participant wrote
These three dependent variables were combined with two different independent variables to produce the six hypotheses. The independent variables are the Period (an ordinal measure of time) and Technique (YW or TDD). The hypothesis in each of the 6 cases is two-tailed. They are not predicting if the value will be higher or lower, just that it will be different.
HN1X. There is no significant effect of Period with respect to X (i.e., QLTY, PROD, or TEST).
HN2X. There is no significant effect of Technique with respect to X (i.e., QLTY, PROD, or TEST).
The authors applied a Linear Mixed Model test to if Period, Group, and an interaction between Period and Group explain the variation of each dependent variable (DepVar ~ Period + Group + Period:Group). The tests showed a significant effect of Period for TEST and of the interaction effect (Period:Group) for QLTY and PROD. The interaction effect was not part of any hypothesis, so it is interesting to note but not meaningful for rejecting any of the null hypotheses.
The authors conclude:
There is evidence that a signicant effect of Period on the number of tests the participants wrote exists. According to the boxplots in Figure 2.c, the significant difference in Period is not due to a deterioration of TEST values for TDD over time—the worst distribution can be observed in P1—therefore, we can conclude that the ability of writing unit tests is retained by developers using TDD.
Since we found a significant effect of Period and in accordance with the results from the descriptive statistics (i.e., there is a clear difference in favor of TDD in P4 with respect to YW in P1 on the same experimental objects), we reject HN2TEST. Therefore, we can conclude that there is a significant effect of Technique on the number of tests the participants wrote.
I’ll be upfront about this. When I first read this paper I thought it was terrible. As I re-read portions of it in order to write this, my opinion has improved slightly. I no longer think that it is a complete waste of space. However, I still believe there are fundamental flaws.
The closest thing that I can surmise to be the underlying theory here is: once students have been taught TDD they will be able to remember their training and apply it after a number of months. I find that theory to be in the territory of “so what?” being a reasonable question. This leaves the study that will be drawn from it to be dangerously without footing for the hypotheses to be tested and the measures that will be used.
However, the theory I’ve stated is closely tied to the first Research Question: “To what extent do novice software developers retain TDD and how does this affect their performance?” The concept of “TDD retainment” is used throughout the study, but the authors do not discuss what it means. This is another case of leaving the rest of the study without solid footing.
The second research question that the authors chose uncovers the problems that I believe this research has: “Are there differences between TDD and YW in the external quality of the implemented solutions, developers’ productivity, and number of tests written?” If the answer to this question ends up being “No!”, then the design of the study cannot answer the first research question as the authors only measure these three variables.
However, even if the answer is “Yes!” to that second research question, I do not believe that this design actually tells us much about the first question as I do not believe that the variables measured indicate whether or not the participants were actually practicing TDD. The section on Validity will highlight the problems that mean that the study can’t reliably answer the authors’ second question either.
The selection of the students is interesting and a good example of convenience sampling. What would be more convenient than a group of students about to learn the techniques that you want to test? The group of students was probably well suited to this study. I wonder if offering credit in a course for participating a) was ethical and b) influenced the results.
My concern about ethics is that providing course credit to participants could be coercive if students believe that in order to compete effectively for grades in the course they need to take part in the research. This can be compensated for by ensuring that allowances for non-participants to achieve similar extra credit is also offered. I would feel better about this if the authors’ had mentioned that they had sign off from an ethics board. I should note that I have yet to read a research article about software engineering that mentions ethics board approval, so this paper is not an outlier.
The impact of the credit on the results is that students might take part in the study in order to simply get the credit with the understanding that their performance in the study has no impact on the credit. As there is no impact, then they just need to be there, but have no intention to perform the task. That could be the explanation for why for every variable in every period (except TEST in P4) there was at least one participant that achieved a score of 0 (TEST in P4 appears to have a lowest value of 1).
It would have been nice to provide some descriptive information about the students. They report that they asked the participants for background and demographic information and yet did not provide any of it nor what questions were asked.
The authors used random assignment to split the participants into the two groups, which is the right way to approach this. It removes any chance of systematic bias one way or the other between the groups.
As I said above, I don’t think that what the authors measured (QLTY, PROD, TEST) are connected enough to the activity of TDD to answer the first research question or address “retainment of test-driven development”. TDD is the cycle of red-green-refactor as well as using that cycle to build up a solution. The effect of performing that cycle could show up in those variables, but it also could end up not as the same quality and same tests are absolutely achievable using other methods of development.
Whether or not the participants actually practiced TDD is not established based on the design as reported. There does not appear to have been any measure of the activities of the participants as they worked nor any debrief after they finished to assess whether their manipulation (“Do TDD” or “Do Your Way”, what the manipulation was is not reported) actually had an impact (this is called a manipulation check). Without this we have to wonder if the manipulation that was intended ever happened. Maybe when the students hear “practice TDD” they actually interpret that to simply mean “write more tests”.
There is another threat from the design. This one is a testing effect caused by the times at which they measured the participants. The authors changed period from an interval variable (specific dates, which have a span of time between them) into an ordinal (events that happened in an order). This disguises where the testing effect happens: between periods P1, P2, and P3 there are multiple weeks or months; however P4 is the next day after P3. It is reasonable to assume that the measurement at P3 reminded the participants of what they were going to be doing, then at P4 (the next day) all of the participants were warmed up and ready to put a bit more effort into it. They discuss after having come out of the room for the P3 measurement and psych each other up. The next day they come back and put in a bit more effort with the effect that they write a few more tests.
The last threat that I’ll mention is maturation. The authors attempted to explain this one away by saying:
Finally, control over subject maturation was checked by making sure that the students attended the same courses between the first observation and the last one.
The authors’ dismissal of this threat shows a lack of understanding of the threat itself. Maturation happens simply because people change over time. Taking the same courses does not control maturation. Maybe those courses continued to teach TDD and so the ongoing education could have set them up for the results measured.
I believe that maturation of the students as software developers is a reasonable alternative explanation for the increase in TEST over time. As they learned more about computer science, they learned more about what should be tested and therefore they produced more tests. Combine that with the priming effect of the short time elapsed between P3 and P4 and there is a plausible explanation for the results without any “retainment” of TDD needed.
Because of the flaws in either the reporting of the design of this study, I don’t think there is anything that can be concluded about TDD based on this study. However, I still found value in reading this research article. I found it a useful example of how, at first glance, research can seem correct and plausible, but on closer analysis have a number of flaws that removed the validity of the research.