Paper Critique: On the diffuseness and the impact on maintainability of code smells

Palomba, F., Bavota, G., Penta, M. et al. Empir Software Eng (2018) 23: 1188. https://doi.org/10.1007/s10664-017-9535-z

Suppose you are working on some chunk of code. You notice that the logic is a bit convoluted, some of the class fields are public, there seem to be more methods on the class than make sense for the concept, and the particular method you are looking at is absurdly long. You sit at your keyboard and the training about code smells in you says that you need to be a responsible software developer and address those smells. Then your other training about engineering being about tradeoffs kicks in and you wonder, which of these matter? What impact do they have?

Overview of the Study

A fair warning up front, this is a very large study, with many different analyses performed and explorations reported. I’ve had to leave out a lot from my summary.

The study investigates the following three questions:

  • RQ1: What is the diffuseness of code smells in software systems? This is a preliminary research question aiming at assessing to what extent software systems are affected by code smells.
  • RQ2: To what extent do classes affected by code smells exhibit a different level of change- and fault-proneness with respect to non-smelly classes? Previous work (Khomh et al. 2012) found that classes affected by at least one smell have a higher chance of being change- and fault-prone than non-smelly classes. In this work we are interested in measuring the change- and fault-proneness magnitude of such classes, in
    terms of number of changes and of bug fixes.
  • RQ3: To what extent do change- and fault-proneness of classes vary as a consequence of code smell introduction and removal? This research question investigates whether the change- and fault-proneness of a class increases when a smell is introduced, and whether it decreases when the smell is removed. Such an analysis is of paramount importance because a class may be intrinsically change-prone (and also fault-prone) regardless of whether it is affected by code smells.

In order to investigate these question, the authors created a tool to loosely find 13 different smells in Java source code. They then manually reviewed the smells identified by the tool to determine the “true” smells in the code under investigation. This is a reasonable approach, in my opinion, as they are interested in what a developer would percieve as a smell and not what a tool would.

The tool outputs a list of candidate code components (i.e., classes or meth- ods) potentially exhibiting a smell. Then, we manually validated the candidate code smells suggested by the tool. The validation was performed by two of the authors who individually analyzed and classified as true or false positives all candidate code smells. Finally, they performed an open discussion to resolve possible conflicts and reach a consensus on the detected code smells. To ensure high recall, our detection tool uses very simple rules that overestimate the presence of code smells.

The code reviewed was from a non-random sample of 30 open source Java projects. These projects were chosen to cover a range of sizes (measured in KLOC), domain, ages, and development communities. Some examples of the projects used are: Eclipse Core, Hadoop, Hbase, JEdit, Sax, and Wicket. Strangely the reported size of the Hibernate project is 5 classes. Other project sizes seem reasonable, but this outlier is not mentioned in the paper. This may simply be an artifact of using a specific “superproject” repository, but it puts some doubt in my mind as to the validity of their sampling method.

In total the study covered 395 different releases of these 30 projects. The analysis was largely performed over the released code and not over the intermediate states as each commit is made. The analysis into fault-proneness did look at individual commits in order to determine which release introduced or removed a bug. This is done with the SZZ algorithm (Sliwerksi et al. 2005). Commits were classified as bug fixes by looking for references to issue trackers in the commit message and then verifying that the issue was indeed a bug report.

Description of Code Smells

Diffuseness of code smells

Diffuseness is defined by the authors as “the percentage of code components in a system affected by at least one instance of the code smell type.” I’ll note right now that I find the term “diffuseness” very hard to work with in this context, which might have also affected the authors as their own analysis of diffuseness diverged quite often from their definition. However, the intention of this investigation is more to get a “lay of the land” view of code smells than to draw any particular conclusions.

Code smells were unequally diffused throughout the releases that were studied. The Long Method smell seemed to be everywhere (at least one instance in 84% of the releases studied). In one case (Apache Derby 10.1) there were so many that they looked more closely and found that it was due to parsing code (org.apache.derby.impl.sql.compile). Long Methods connected to parsing were found in other cases as well.

Overall they classified 5 smells as highly diffused, 4 as medium diffused, and 4 as poorly diffused. Unfortunately the authors do not define what those three levels mean nor discuss the actual method for analysing diffuseness.

Diffuseness of code smells

Change- and fault-proneness of classes

In order to investigate the change-proneness question the authors produced a dataset that counted the number of changes (count of commits) to a class in a release of a project and whether the class was affected by a smell in that release. The fault-proneness was investigated with a similar dataset, but instead of the number of changes to the class the number of bug-fixing changes to the class were considered. They then used a Mann-Whitney test to check for statistically significant difference between the smelly and non-smelly classes across all releases.

They report a statistically significant difference between smelly and non-smelly classes for both change- and fault-proneness (p<0.001 for both tests). See the Rigour section below for more analysis of these results.

Change-pronenessFault-proneness

Summary for RQ2. Our results confirm the findings by Khomh et al (2012): Classes affected by code smells have a statistically significant higher change- (large effect size) and fault- (medium effect size) proneness with respect to classes not affected by code smells. Also, we observed a very clear trend indicating that the higher the number of smells affecting a class the higher its change- and fault-proneness.

Change- and fault-proneness from introduction/removal of smells

The authors used the SZZ algorithm to determine when a particular code feature that was changed for a bug fix had been introduced. With this information they studied how the change- and fault-proneness of the code varied as it transitioned between being smelly and not-smelly.

The authors found that some code smells increased change-proneness (God Class, for instance) while others reduced change-proneness (Lazy Class). The differences between when classes were smelly as compared to not-smelly were evident by simple inspection of the plots. Fault-proneness showed much less effect, except for the case of Feature Envy. The authors reason about this as follows.

For all other smells we did not observe any strong difference in the fault-proneness of the classes when comparing the time periods during which they were affected and not affected by code smells. While this result might seem a contradiction with respect to what observed in RQ2 and in the previous study by Khomh et al. (2012), our interpretation is that classes that were fault-prone in the past will still continue to be fault-prone, even if a smell was removed. Moreover, since a smell removal requires a change to the code, it can have side effects like any other change, thus possibly affecting the fault-proneness independently of the smell. This is also in agreement with previous studies that used the past fault-proneness history of classes to predict their future faults (Ostrand et al. 2005). In essence, there seems to be no direct cause-effect relationships between the presence of code smells and the class fault-proneness. Rather, those are two different negative phenomena that tend to occur in some classes of a software project.

The conclusion for RQ3 is that bugs and smells co-occure, but do not appear to have a causal relationship.

Summary of Results

Critique

This was a wide ranging correlational study that included three different questions, and two questions had at least two parts. Covering at least five different aspects of code smells, that although related, was a large undertaking and in my opinion would have been better taken in smaller chunks. I think that by putting all of these questions together some important aspects are missing that could have greatly improved the value of the work that the authors put in.

Theory

Right out of the gate, there is no clear theory that the authors are drawing from. This leaves the questions and the definitions hanging in the air without a strong foundation. I believe that I could make up my own theories around code smells and code maintainability that explain the research questions; however, I would like to see an explicit, testable theory referenced as the impetus for their research questions. The basis for the questions and the research ends up being a much less rigorous “more data is better” analysis. This shows up in the Introduction when they state “this paper aims at corroborating previous empirical research on the impact of code smells by analyzing their diffusenes and effect on change- and fault-proneness on a large set of software projects.” I found the data and report interesting, but it left me wondering how these results are supposed to feed into a larger understanding of the world. Maybe there is a clear theory that underlies the existing research and its was elided because the authors took it as a given.

Sampling

The sampling of systems for analysis was conducted in a non-random manner. From the description of the selection criteria, it appears that they used a Quota Sampling technique.

The choice of the subject systems was driven by the will to consider systems having different size (ranging from 0.4 to 868 KLOCs), belonging to different application domains (modeling tools, parsers, IDEs, IR-engines, etc.), developed by different open source communities (Apache, Eclipse, etc.), and having different lifetime (from 1 to 19 years).

This method of sampling can create bias in the systems analysed. As the authors did not report on how they developed their sampling frame it is impossible to tell what kind of biases might be present based on their method. However, a quick inspection of the systems included in the analysis appears to show a bias toward systems that involve parsers (2 XML Parsers, 8 database related systems, Eclipse).

Such a bias could explain the distribution of smells that they report. Specifically the large number of Long Method cases. The authors themselves point out that parsers are likely to suffer from this smell:

While we cannot draw any clear conclusion based on the manual analysis of these two systems, our feeling is that the inherent complexity of such parsing methods makes it difficult for developers to (i) write the code in a more concise way to avoid Long Method code smells, or (ii) remove the smell, for instance by applying extract method refactoring.

Validity

I am left wondering what to do with the section about diffuseness of code smells. The definition of the variable of diffuseness is given very specifically as “the percentage of code components in a system affected by at least one instance of the code smell type.” The authors then provide an anlysis that does not appear to be based on this definition. Instead they offer three alternatives, all of which span multiple systems by aggregating by smell type rather than by system. They also leave me wondering how they operationalised diffuseness when the definition is applied across multiple releases of a system. They then changed their ratio measure of diffuseness into an ordinal measure by translating it to High/Medium/Low. There is no explanation of this transformation, why it was done, or what the reader should make of it.

The handling of the change- and fault-proness data appeared to be much better. There was enough information given that I could work out what data their analysis might have been based on, although it did take a little back and forth through the paper and thinking on my part to work out the exact variables they used.

The non-statistical analysis of the change to change- and fault-proneness (Research Question 3) with the introduction and removal of code smells seemed reasonable given the framing of the study as exploratory. The authors present boxplots of the data and give their interpretation and exploration of the data. I found this both interesting to read and enlightening at the same time. The presentation and exploration of the change- and fault proneness data for Research Question 2 was likewise interesting and enlightening.

There is, however, a possible problem with the statistical significance analysis for RQ2 (and possibly for the analysis done for RQ3). The authors use the Mann-Whitney test to check for a significant difference between smelly and non-smelly classes across all releases of all systems that they studied. They applied this test for both number of changes and number of faults. I believe that the Mann-Whitney test is inapplicable to this data as it requires independent observations.

The dependence of observations becomes apparent if we clearify the variables being measured and the theory that surrounds those variables. In this case, I believe that the variables for observations in the change-proneness analysis (the fault-proneness analysis is similar) are (name, level of measure, description):

  • system (nominal): The identifier of the system (e.g. HBase)
  • release (ordinal): The identifier of the release of the system (e.g. 10.1)
  • class (nominal): The identifier of the class in the release of the system (e.g. org.apache.cassandra.service.StorageService)
  • number of changes (ratio): The count of commits affecting the class in the release of the system
  • smelly (nominal): True if the class exhibits a code smell in the release, False otherwise

A basic theory that holds these together is that a system is composed of a number of different classes, which interact to provide some sort of service. The system is modified over time and at specific points during those modifications the state of the system is frozen to create a release. The classes of the system come into being at a specific point, are modified, and may, at some point, be removed from the system. During their lifespan they appear in one or more releases and over time will or will not exhibit code smells as modifications are made. The classes of a system are related to each other and exhibit interactions, which is known as the design of the system.

Given this theory and these measures there are at least two dependencies that show up: (1) the smelly variable for a class in a release is going to depend on the smelly value for the same class in previous releases (the classes are not completely rewritten each release) and (2) the smelly variable for a class in a release could (this is harder to see) depend on the smelly variables of other classes in the same release (the structure of any individual class depends on the design of the system, which determines what is in all of the other related classes). The consequence of (1) is that the data would be better treated as repeated measures (and therefore the Friedman Test might be better suited). I’m not entirely sure of the consequence of (2) for how to treat the data, as it takes me beyond my understanding of statistical analysis, but I assume it means that much more complex analysis is needed.

Conclusion

I found the investigation and the exploration that the authors performed to be interesting and I think I can see how I make use of it. When I read past the flaws I also appreciate the amount of work that the authors put into this. They should be applauded for putting this together. Making the data set available for others could lead to testing interesting hypotheses that would otherwise not be done given the amount of effort required to obtain this amount of data on a wide range of software projects.

The authors’ conclusions reinforce my belief that code smells are not a problem in and of themselves. Code exhibits code smells because either of essential complexity or accidental complexity. Unfortunately, the smells themselves cannot tell me which kind of complexity I am dealing with and so I cannot simply blindly apply refactoring to remove them. The code smell should instead be seen as an invitation to engage more completely with the domain problem that the code is there to solve and find the underlying design mismatch or complexity that needs to be addressed.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: