自己动手打造Github代码泄露监控工具
字数 1056 2025-08-18 11:37:20
GitHub代码泄露监控工具开发指南
背景与原理
GitHub作为开源代码托管平台,经常存在敏感信息泄露风险,包括:
- 用户名、密码、数据库连接信息
- 内网IP地址
- 开发者个人信息(身高、体重、年龄等)
本工具通过自动化监控GitHub代码仓库,及时发现包含敏感信息的代码提交。
开发环境与依赖
系统环境:
- MacOS 10.12.6(也可在其他支持Python的系统运行)
- Python 3.6.5+
所需Python库:
requests:HTTP请求lxml:HTML/XML解析csv:CSV文件操作tqdm:进度条显示email/smtplib:邮件发送configparser:配置文件解析time:时间控制
核心功能实现
1. GitHub登录机制
GitHub登录流程:
- 访问
https://github.com/login获取登录页面 - 提取
authenticity_token值 - 向
https://github.com/session提交POST请求
关键代码:
def login_github(username, password):
login_url = 'https://github.com/login'
session_url = 'https://github.com/session'
try:
s = requests.session()
resp = s.get(login_url).text
dom_tree = etree.HTML(resp)
key = dom_tree.xpath('//input[@name="authenticity_token"]/@value')
user_data = {
'commit': 'Sign in',
'utf8': '✓',
'authenticity_token': key,
'login': username,
'password': password
}
s.post(session_url, data=user_data)
s.get('https://github.com/settings/profile') # 验证登录
return s
except:
print('产生异常,请检查网络设置及用户名和密码')
2. 代码搜索与解析
搜索流程:
- 构造搜索URL:
https://github.com/search?p=[页码]&q=[关键词]&type=Code - 使用XPath解析返回的HTML页面
- 提取关键信息:
- 仓库URL
- 用户名
- 上传时间
- 文件名
关键代码:
def hunter(gUser, gPass, keyword, payloads):
sensitive_list = []
tUrls = []
try:
with open('leak.csv', 'w', encoding='utf-8', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['URL', 'Username', 'Upload Time', 'Filename'])
s = login_github(gUser, gPass)
for page in tqdm(range(1, 6)): # 搜索1-5页
search_code = f'https://github.com/search?p={page}&q={keyword}&type=Code'
resp = s.get(search_code)
dom_tree_code = etree.HTML(resp.text)
# XPath提取信息
Urls = dom_tree_code.xpath('//div[@class="d-inline-block col-10"]/a[2]/@href')
users = dom_tree_code.xpath('//a[@class="text-blod"]/text()')
datetime = dom_tree_code.xpath('//relative-time/text()')
filename = dom_tree_code.xpath('//div[@class="d-inline-block col-10"]/a[2]/text()')
for i in range(len(Urls)):
full_url = 'https://github.com' + Urls[i]
tUrls.append(full_url)
writer.writerow([full_url, users[i], datetime[i], filename[i]])
# 检查原始代码中的敏感信息
raw_url = 'https://raw.githubusercontent.com' + Urls[i].replace('/blob', '')
code = requests.get(raw_url).text
for payload in payloads:
if payload in code:
leak_info = f"命中的Payload为: {payload}\n{full_url}\n\n代码如下:\n{code}\n\n"
sensitive_list.append(leak_info)
return sensitive_list
except Exception as e:
print(e)
3. 邮件预警系统
邮件功能特点:
- 支持多个收件人
- 可发送HTML格式邮件
- 附带CSV格式的搜索结果附件
关键代码:
def send_warning(host, username, password, sender, receivers, content):
def _format_addr(s):
name, addr = parseaddr(s)
return formataddr((Header(name, 'utf-8').encode(), addr))
msg = MIMEMultipart()
msg['From'] = _format_addr(f'Github安全监控<{sender}>')
msg['To'] = ', '.join(receivers)
msg['Subject'] = Header('Github敏感信息泄露通知', 'utf-8').encode()
# 邮件正文
msg.attach(MIMEText(f'''
Dear all
请注意,怀疑Github上已经上传敏感信息!以下是可能存在敏感信息的仓库!
{content}
''', 'plain', 'utf-8'))
# 添加附件
with open('leak.csv', 'rb') as f:
m = MIMEBase('excel', 'csv', filename='leak.csv')
m.add_header('Content-Disposition', 'attachment', filename='leak.csv')
m.add_header('Content-ID', '<0>')
m.add_header('X-Attachment-ID', '0')
m.set_payload(f.read())
encoders.encode_base64(m)
msg.attach(m)
try:
server = smtplib.SMTP(host, 25)
server.login(username, password)
server.sendmail(sender, receivers, msg.as_string())
print('邮件发送成功!')
except Exception as err:
print(err)
finally:
server.quit()
4. 配置文件管理
配置文件格式(info.ini):
[KEYWORD]
keyword = your main keyword here
[EMAIL]
host = Email server
user = Email User
password = Email password
[SENDER]
sender = The email sender
[RECEIVER]
receiver1 = Email receiver No.1
receiver2 = Email receiver No.2
[Github]
user = Github Username
password = Github Password
[PAYLOADS]
p1 = Payload 1
p2 = Payload 2
p3 = Payload 3
p4 = Payload 4
p5 = Payload 5
p6 = Payload 6
配置文件读取代码:
config = configparser.ConfigParser()
config.read('info.ini')
# 读取GitHub凭据
g_User = config['Github']['user']
g_Pass = config['Github']['password']
# 读取邮件配置
host = config['EMAIL']['host']
m_User = config['EMAIL']['user']
m_Pass = config['EMAIL']['password']
m_sender = config['SENDER']['sender']
# 读取收件人列表
receivers = [config['RECEIVER'][k] for k in config['RECEIVER']]
# 读取搜索关键词和payload
keyword = config['KEYWORD']['keyword']
payloads = [config['PAYLOADS'][key] for key in config['PAYLOADS']]
使用建议
-
搜索策略:
- 主关键词:公司域名、邮箱后缀、员工姓名等
- 辅助payload:password、username、database等敏感关键词
-
运行频率:
- 建议每天运行2次
- 避免频繁请求触发GitHub反爬机制
-
部署方式:
- 使用Linux的crontab定时任务
- 示例crontab配置(每天运行两次):
0 9,21 * * * /usr/bin/python3 /path/to/github_monitor.py
完整代码获取
项目已开源在GitHub:
https://github.com/Hell0W0rld0/Github-Hunter
注意事项
- GitHub可能会更新页面结构,需要定期维护XPath表达式
- 工具仅用于企业安全自查,请勿用于非法用途
- 建议使用专用监控账号,避免使用个人GitHub账号
- 搜索结果可能包含误报,需要人工确认
通过本工具,企业可以及时发现并处理GitHub上的敏感信息泄露,降低安全风险。