github的仓库是可以统计每个贡献者的代码行数的,公司年会的时候,特设了一个“码神奖”,颁给去年贡献代码最多的工程师,github的统计数据显示,这位大神去年提交的代码达到了110w行,这个数据太惊人了,一个人不可能写这么多代码的,我非常好奇的研究了一下,发现中间还包括了他提交的很多第三方库,但github也一并统计了,而且经过他合并的代码也会统计进去。那么有没有办法去掉这些无效数据,得到真实的代码贡献量呢?查了一下github api,再结合git 命令,还是可以的,上代码:
#copy this script to your target repo #run python github-stats.py to collect data import re import json import os import sys import requests #get token from cmd line tk = sys.argv[1] user_stats={"dummy":{"additions":0,"deletions":0,"total":0}} #query github api for last year‘s commits payload = {‘since‘:‘2013-01-01T00:00:00Z‘,‘until‘:‘2014-01-01T00:00:00Z‘,‘access_token‘:tk} token = {‘access_token‘:tk} def is_merge(commit_sha): cmd = "git show --oneline " + commit_sha output = os.popen(cmd) title = output.read() p_merge = re.compile("Merge") if(p_merge.search(title) is not None): return True else: return False def collect_stats(commit_list): for m in commit_list: #print user_stats #print m[‘sha‘] #print data if(is_merge(m[‘sha‘])): continue git_show_command = "git show -s --format=%an " + m[‘sha‘] output = os.popen(git_show_command) user = output.read().strip(‘ \t\n\r‘) #print user #r2 = requests.get(commit_request_api+m[‘sha‘], params = token) #commit = r2.json() #print commit git_diff_command = "git diff --shortstat "+m[‘sha‘] + " " + m[‘sha‘] + "^" output = os.popen(git_diff_command) data = output.read() #print "data is:" #print data p_ins = re.compile("(\d+) insertion") r_ins = p_ins.search(data) ins_data = 0 del_data = 0 if(r_ins is not None): ins_str = r_ins.group(1) ins_data = int(ins_str) #print ins_data p_del = re.compile("(\d+) deletion") r_del = p_del.search(data) if(r_del is not None): del_str = r_del.group(1) del_data = int(del_str) #print del_data if(ins_data + del_data > 5000): print user print ‘ins:‘+str(ins_data) print ‘del:‘+str(del_data) ins_data = 0 del_data = 0 if(user in user_stats): stats = user_stats[user] stats[‘additions‘] += ins_data stats[‘deletions‘] += del_data stats[‘total‘] += (ins_data + del_data) user_stats[user] = stats else: new_stat = {‘additions‘:ins_data, ‘deletions‘:del_data, ‘total‘:ins_data+del_data} user_stats[user] = new_stat r = requests.get("https://api.github.com/repos/cocos2d/cocos2d-x/commits", params = payload) collect_stats(r.json()) print user_stats pattern = re.compile("<(\S+)>; rel=\"next\"") h = r.headers print r.headers[‘X-RateLimit-Remaining‘] result = pattern.search(h[‘link‘]) while(result is not None): next_url = result.group(1) r = requests.get(next_url, params = token) collect_stats(r.json()) h = r.headers print h[‘link‘] result = pattern.search(h[‘link‘]) #print h[‘link‘] #next_url = result.group(1) #print next_url #r_next = requests.get(next_url[1]) print r.headers[‘X-RateLimit-Remaining‘] print user_stats代码也可以在github上获得: https://github.com/heliclei/githubtools/blob/master/github-stats.py
这个脚本过滤了单次提交超过5000行的commit,并且过滤了合并的commit,先把需要统计的仓库克隆到本地,再把这个脚本拷贝到本地git仓库下,注意要把这一行改为对应仓库的url
https://api.github.com/repos/cocos2d/cocos2d-x/commits
github token可以用上一篇的脚本生成
运行 python git-stats.py xxxxxxxxxxxxxgithub-oauth-tokenxxxxxxxxxxxxxxxxxxx
PS: 过滤后,cocos2d-x的码神去年的代码贡献量超过了10w行,还是非常的厉害~~但这个数据没有110W行那么超现实了。