Data files used in our paper

Paper title: A Doc2Vec-Based Assessment of Comments and Its Application to Change-Prone Method Analysis
Conference: APSEC2018
Status: published as
Hirohisa Aman, Sousuke Amasaki, Tomoyuki Yokogawa and Minoru Kawahara,
``A Doc2Vec-Based Assessment of Comments and Its Application to Change-Prone Method Analysis,''
Proc. 25th Asia-Pacific Software Engineering Conference (APSEC 2018), pp.643--647, Dec. 2018.
data file description
Data for analysis
(tab-separated values (TSV) format)
These TSV files presents the data of commented methods.
Each file corresponds to each project.
Each line of a TSV file corresponds to each method, whose columns are as follows.
  • "file": Java source file path
  • "begin_line": the line number on which the method's body starts
  • "end_line": the line number on which the method's body ends
  • "signature": the method's signature
  • "change": the number of changes occurred in the method after the release of the analyzed version
  • "sim": the cosine similarity computed for the method
Doc2Vec models
This ZIP file contains 5 directories whose names correspond to project names;
each directory has the following 3 files.
  • doc2vec.model: the Doc2Vec model file created by
  • target_method_contents.txt: a text file whose each line presents the content of each method, where stop words and symbols are excluded; this file is used for the Doc2Vec learning by the above python script.
  • target_method_original_contents.txt: the original content of methods; target_method_contents.txt is generated from this file through the stop word elimination and the symbol elimination.
The cosine similarities in the above analysis data are computed by with the model corresponding to the project.
(C) 2018 Hirohisa Aman <aman (at)>