Data files used in our paper

Paper title: Empirical Study of Fault Introduction Focusing on the Similarity among Local Variable Names
Conference: QuASoQ2019
Status: published as
Hirohisa Aman, Sousuke Amasaki, Tomoyuki Yokogawa and Minoru Kawahara,
``Empirical Study of Fault Introduction Focusing on the Similarity among Local Variable Names,''
Proc. 7th International Workshop on Quantitative Approaches to Software Quality, pp. 3--11, Dec. 2019.
data file description
Data for analysis
(tab-separated values (TSV) format)
These TSV files presents the data of local variables in methods.
Each file corresponds to each project.
Each line of a TSV file corresponds to each method, whose columns are as follows.
  • "Similarity": The highest Levenshtein similarity (HLS) in the method; "-1" signifies that there is no pair of local variables in the method because the number of local variables is less than 2.
  • "n_Vars": The number of local variables in the method.
  • "LOC": The lines of code of the method.
  • "CC": The McCabe metric value (cyclomatic complexity) of the method.
  • "Historage_hash": The SHA (hash code) of the method on the Historage repository.
  • "Method": the file path corresponding to the method on the Historage repository.
  • "Vars": the list of local variables in which the variables are separated by "/"; "---" means that there is no variable in the method.
  • "BugIntroduction": whether a bug introduction occured (1) or not (0).
Performance evaluation results (for RQ3)
(comma-separated values (CSV) format)
These CSV files presents the data of performance evaluation results (for RQ3).
Each file corresponds to each project.
Each line of a CSV file corresponds to each iteration of performance evaluation.
  • "itr": the iteration number from 1 to 100.
  • "type": whether the model is the baseline RF model ("base") or the RF+classification model ("comp").
  • "recall": The recall value.
  • "precision": The precision value.
  • "fval": The F value.
  • "diff.r": the difference of the recall from that of "base" model with the same iteration number.
  • "diff.p": the difference of the precision from that of "base" model with the same iteration number.
  • "diff.f": the difference of the F value from that of "base" model with the same iteration number.
  • "diff.r.rate": (diff.r) / (the recall of "base" model), with the same iteration number.
  • "diff.p.rate": (diff.p) / (the precision of "base" model), with the same iteration number.
  • "diff.f.rate": (diff.f) / (the F value of "base" model), with the same iteration number.
Historage archives
(tar.gz (TGZ) format)
These TGZ files are archives of the Historage repositories.
SZZ results
(JSON format)
These JSON files are the results of running SZZ Unleashed. Each JSON file presents pairs of the fault-fix commit hash and the fault-introducing commit hash. Notice that these hash codes are the values used in the original Git repositories, not in the (converted) Historage repositories.
(C) 2019 Hirohisa Aman <aman (at) ehime-u.ac.jp>