Mercurial Ate Our Breakfast, But We Don't Mind
Before we begin, here’s a spoiler: Tim and I independently invented revsets, but Matt Mackall put them in Mercurial 1.6 before we had told anyone about it, so we can’t rightfully proclaim our genius to the internet. Fortunately, that’s not the whole story.
Two Guys and a Database Course
In the fall of 2009, Tim and I were taking a course in database systems. As the only undergraduate students in the class, we didn’t have a good concept of what goes into a good project for a course consisting primarily of discussions about research papers, especially since our usual modus operandi for most course projects is to point a fire hose of code at an implementation idea. For a couple of weeks, we were lost. Then one day Tim said to me, “Steve, what if you could query your version control repository like a database?” Suddenly we had a thought experiment and a series of course projects that will last at least into the next year.
Think about the wealth of data stored in a repository. By looking at that data, you can see who wrote what and when, who the clients of a committer’s code are, who collaborates with whom, who maintains code written in the past by someone else, and even who has the most impact on the project as a whole. Some hosts such as Github offer basic statistics, and some individuals have written scripts to gather more interesting data about committer participation, but these are not conceptually scalable solutions. A generalized layer on top of SCMs could make data mining easier and more useful. Our name for this layer was SourceQL. We thought of SourceQL as a query language and index that would make it simple to find out things like the average longevity of a committer’s code, i.e. how long, on average, the committer’s code lasts before being overwritten by someone else’s changes.
We spent that semester brainstorming ways to query repository data, focusing primarily on finding ways to specify subsets of the full graph to search (essentially revsets) and efficient ways to index the graph. We also developed the basics of an Xpath-like language for working with sets of revisions. Mercurial revsets didn’t exist at that time, but our ideas weren’t solid enough to describe formally, so we let them stay in Professorland. The final report for that course is on Scribd now, along with 6 other papers about it.
During our winter break, we started to think about ways of indexing text file contents to allow for fast searching, since tools like ‘hg grep’ are slow and only provide addition and deletion points of a string, rather than all revisions where that string appears. We read Russ Cox’s article Regular Expression Matching: The Virtual Machine Approach and determined that a VM-based regular expression engine would map well to tries, which would allow us to index and search repository content efficiently. By the time we got back to CWRU in January, I had implemented the compiler and Tim had implemented the virtual machine. We called our design ReTrie.
Four Guys and a Project Course
In the spring of 2010, Tim and I were both “technically” seniors, so we signed up for the senior project course as well as an advanced data structures course which also required a project. This time, we wanted to point the code fire hose at SourceQL to see how far we could get. Evan Krall and Gary Doran joined us to help with the implementation.
At the end of the semester, we had implemented a framework for block-based disk data structures, B+trees and B-trees, a disk-based trie, regular expression search on that trie, a parser library for Go, a query language for specifying subsets of a directed acyclic graphs, and Mercurial hooks put repository data into our indexes. It was all tied together in a handy command line tool. We patted ourselves on the back for building a system that could give you information about your repository that you couldn’t easily get anywhere else. A full description of what we accomplished can be found in the final report for the project course.
Then we went off to our summer internships, got busy, and didn’t publish anything. In terms of Internet Fame Management, this was a big mistake, because Mercurial 1.6 was released on July 1st, two months after we had finished the first version of SourceQL. It introduced revsets, a near-complete implementation of everything we had developed in terms of subgraph selection. A delicious omelet had been snatched from our plates because we had been carving little faces into our toast with forks.
But actually, that’s for the best.
It’s All About the Data
Sure, we missed a chance at some minor implementation karma. But revsets were on their way anyhow, and SourceQL still isn’t ready for common usage. There is a lot left to do before it becomes a useful tool. Not only that, but the functionality of revsets really belongs in the hands of Mercurial maintainers who can optimize it for their storage system.
Our end goal with SourceQL was to use it to run queries to gather telling information about a code base. One such query would find the blast radius of a change, i.e. how much code will be affected by a change to a file. One simple way to measure blast radius is to track which files are typically edited in the same commit, or perhaps expanding granularity down to locations of lines edited. Another data point is the committer longevity metric mentioned earlier. With push/pull information, graphs of collaboration patterns can be built to demonstrate the de facto power structure of a project. Files can be checked for commit frequency to determine which parts of the code should be targeted for more extensive testing.
Revsets don't solve the idea of a query on repository data, they just help the user narrow the scope of the query. The real magic happens when it's easy to join different pieces of data together to create a whole picture. Revsets don’t provide enough expressiveness and power to truly unlock the data contained in the repository. That is the dream of SourceQL: to allow developers to learn about their code without first having to write a complex script.
(If you're still wondering what the heck revsets are, subscribe to the RSS feed or check back later.)
What Now?
For starters, we want to publish more of what we’ve learned by thinking about version control systems. We don’t have any startups planned in this area, it’s just something we want to see happen. Whether or not we finish SourceQL and see people use it, we think it is important that people realize the potential for repository data beyond save/undo. This blog will in part be a home for posts about the power of version control.
We want to keep working on SourceQL, including integrating it with revsets and improving our string index. No existing SCMs attempt to index file contents for retrieval, and with good reason: such indexes are generally a waste of space for the way people use their repositories. But we think it can be useful to designate one repository as the “data mining repo” and let it maintain an efficient index. However, we are not satisfied with the design and performance of the B-tries we used in the first version, so I have been working on a new disk-based trie structure. A prototype of this structure is available on Github.
We are still students and are constantly trying to learn from our mistakes. If you have any comments about the quality of our writing or the work we have done, please leave comments.
References
SourceQL Paper 6, the final report for our senior project course, is the most comprehensive overview of SourceQL. You can see all the papers in the Scribd collection. (Create an account to see the spiffy HTML5 version.)
If you want to know more about B-tries, a disk-based trie-like string index structure, read Appendix A of Paper 6 or google them and use your absurdly expensive account on one of the scientific paper exploitation sites to read the original paper about them.
To see the source code and current progress for Steve's disk-based trie explorations, look at the JTrie project on Github.
Russ Cox's second regex article explains the basis for our regular expression engine. If you want a more audiovisual version of that article, check out Tim Henderson's Hacker Society talk about it over at his blog, Hackthology. (It's got pictures!)
Ross Cox himself has been working on RE2, a C++ implementation of the ideas presented in his articles. The hot-off-the-grill Python bindings will probably speed up your own regex code.
Finally, folks on Hacker News have been talking about the best ideas they never followed through on. Funnily enough, it was published the day before this post went live.



1 Comment
http://www.sciencedirect.com/science/article/pii/095058499090126C
Leave a Comment