Data files used in our paper

Paper title: A Doc2Vec-Based Assessment of Comments and Its Application to Change-Prone Method Analysis
Conference: APSEC2018
Status: published as
Hirohisa Aman, Sousuke Amasaki, Tomoyuki Yokogawa and Minoru Kawahara,
``A Doc2Vec-Based Assessment of Comments and Its Application to Change-Prone Method Analysis,''
Proc. 25th Asia-Pacific Software Engineering Conference (APSEC 2018), pp.643--647, Dec. 2018.
data file description
Data for analysis
(tab-separated values (TSV) format)
These TSV files presents the data of commented methods.
Each file corresponds to each project.
Each line of a TSV file corresponds to each method, whose columns are as follows.
  • "file": Java source file path
  • "begin_line": the line number on which the method's body starts
  • "end_line": the line number on which the method's body ends
  • "signature": the method's signature
  • "change": the number of changes occurred in the method after the release of the analyzed version
  • "sim": the cosine similarity computed for the method
Doc2Vec models doc2vec_models.zip
(108MB)
This ZIP file contains 5 directories whose names correspond to project names;
each directory has the following 3 files.
  • doc2vec.model: the Doc2Vec model file created by doc2vec_learning.py
  • target_method_contents.txt: a text file whose each line presents the content of each method, where stop words and symbols are excluded; this file is used for the Doc2Vec learning by the above python script.
  • target_method_original_contents.txt: the original content of methods; target_method_contents.txt is generated from this file through the stop word elimination and the symbol elimination.
The cosine similarities in the above analysis data are computed by doc2vec_sim.py with the model corresponding to the project.
(C) 2018 Hirohisa Aman <aman (at) ehime-u.ac.jp>