Determining logical project structure from commit logs

In a bored 5 minutes at work I threw the following together: Logical source file groupings in the Scala repo

The largest cluster is clearly noisy and random. I more or less expected that. But the small and medium ones often make a lot of sense.

The basic technique is straightforward: We use a trivial script to scrape SVN logs to get a list of files that change in each commit. We use this to calculate the binary pearsons of these observations to get a measure of the similarity between two files (a number between -1 and 1, though we throw away anything <= 0). We then use markov clustering to cluster the results into distinct groupings.

The results are obviously far from perfect. But equally obviously there’s a lot of interesting information in them, and the technique could certainly be refined (e.g. by looking at sizes of diffs on each file and using that rather than a simple 0/1 changed. Also experimenting with other clustering algorithms, etc). Maybe something worth pursuing?


Comments

Jonathan Ellis on 2009-04-30 15:53:43:

What series of pipes do I need to chain this together against my svn repo? is your markov implementation even public?

david on 2009-04-30 16:03:46:

Something like the following should work:

svn log -v | ruby svn_log_munger.rb > svn_commits
path/to/binary-pearsons/bin/pearsons < svn_commits > file_correlations
mcl file_correlations --abc -o file_clusters

This will result in the clusters appearing tab separated on one cluster per line.

sed ‘s/$/\n-------\n\n/; s/ /\n/g; s/_/ /g’ file_clusters | less

will give you a more readable view onto it.

david on 2009-04-30 16:04:37:

Oh, you can get a source install of MCL from the linked page, and you’ll need Ruby, Rake, Java and gcc to build binary-pearsons (you can get the source from the linked github page).

Jonathan Ellis on 2009-04-30 16:16:14:

rake aborted!
no such file to load -- spec/rake/spectask
/home/jonathan/projects/svnfun/binary-pearsons/Rakefile:1:in `require’

?

david on 2009-04-30 16:19:58:

Ah. You’ll need to install rspec (it uses it for testing). Do “sudo gem install rspec” (you may need to install rubygems first).

Alternatively you could just build the files manually if you get fed up of ruby. :-) The java class files should go in classes in the binary-pearsons directory and the C files should compile to a “tally” program which goes in binary-pearsons/bin.

david on 2009-04-30 16:20:45:

Apologies for this being a slight ordeal. I haven’t quite made binary-pearsons easily packageable yet, and I didn’t really design the SVN clustering for reuse - it’s just an amusing quick hack.

david on 2009-04-30 16:27:52:

By the way, you’ll need to edit the ruby script to make it look for the right extension. It currently only looks for scala files (just look for where it says scala and change the extension appropriately)

Jonathan Ellis on 2009-04-30 16:56:11:

ok, I got

Successfully installed rspec-1.2.5
1 gem installed

but still getting the rake error.

sorry, you can’t post about how cool your new toy is and expect people to not want to play with it. :)

david on 2009-04-30 17:02:43:

Bizarre. I’m not sure why that would be happening. The easiest thing to do is probably for you to edit the Rakefile and delete the require and the SpecTask sections.

And I’m fine with you wanting to play with it. :-) I’m just apologising for the fact that it’s not very well built to be reproducable.

david on 2009-04-30 17:04:54:

http://gist.github.com/104517 should be a version that works

Jonathan Ellis on 2009-04-30 18:22:20:

okay, I’m almost there. I built mcl-09-116 and there are a couple dozen executables lying around but none of them is `mcl`. (there is a ubuntu package mcl but that seems to be something else.)

david on 2009-04-30 18:31:56:

Hm. You seem to be correct. How odd! make install will put an mcl binary on your path, but for some reason the binary is not built by the default make target.

You can either install it globally or if you do a ./configure --prefix=somepath before building make install will put all the install files in some other directory. It’s then in somepath/bin/mcl

Jonathan Ellis on 2009-04-30 23:39:08:

Cool, I got it working: http://spyced.blogspot.com/2009/04/automatic-project-structure-inference.html

Thanks for the help!

david on 2009-04-30 23:57:08:

No problem. Glad you got it to work!

AA on 2009-05-04 22:36:41:

eROSE (an eclipse plugin) uses co-change pattern to suggest more files when editing a file ... “other programmers that changes this function also changes this function” ... http://www.st.cs.uni-saarland.de/softevo/erose

david on 2009-05-04 23:14:16:

That’s really cute. Thanks for the link!