A while ago, Rob asked me to hack something together to grab some statistics from git. Being the kind of guy who likes reusing tools to make more complicated beasts, I used awk to do this. He recently asked me to document it better for future reference and so it's not just an email floating out there with obscure awk incantations in it.
git whatchanged --stat --format="%H %at %an" |
grep -v '|' |
awk '
/^[a-f0-9]+ / {
commit = $1
time = $2
sub(/^[a-f0-9]+ [0-9]+ /, "")
name = $0
}
/changed/ {
files = $1
inserts = $4
deletes = $6
print commit "+" time " " files " " inserts " " deletes " " name
}
'
We first ask git for what has changed in the repository (whatchanged) and to get a diffstat instead of the actual diff of each commit. We ask for the commit format to have the hash (%H) followed by the authored time (%at) and the author's name (%an). The author can be different from the one who committed the code, so we ask who actually wrote the commit rather than the one with commit access. We then filter out all of the actual diffstat per-file data since git also sums it up nicely at the bottom.
When awk gets the data, it looks like:
122673a68a95a6f3a27c46624b5dd6d98fcbbab7 1281053627 Ben Boeckel 1 files changed, 13 insertions(+), 0 deletions(-)
The first awk block:
/^[a-f0-9]+ / {
commit = $1
time = $2
sub(/^[a-f0-9]+ [0-9]+ /, "")
name = $0
}
Applies to the commit line (which matches the regex for hexcharacters at the beginning of a line followed by a space). It takes the commit hash, the timestamp and store it to variables. It then strips off those parts and uses the rest as the name.
The second awk block:
/changed/ {
files = $1
inserts = $4
deletes = $6
print commit "+" time " " files " " inserts " " deletes " " name
}
Parses the line that matches 'changed' and counts the number of files changed, lines inserted and lines deleted. It then makes a line composed of the information parsed from git. One weird thing about awk is that the blank character (space) is the concatenation operator on strings. It's a little weird since the spaces I want to insert then must be put in quotes and then surrounded by spaces making the line hard to parse without proper highlighting (which pygments lacks for awk it seems). This line can be put into a database and queried later.
There are a few shortcomings of the script in that if anyone authors a commit by the name 'changed', the second block will also match and screw things up. This is also the reason the per-file diffstat lines are grepped out. This can be fixed with a better match pattern on the second block (/^ / maybe), but it decreases the readability for those who don't know that there just happens to be a space on that line.