精准提取日志中的URL技术详解

背景与需求

在日志分析过程中，经常需要从请求参数中提取嵌套的URL，例如：

http://www.xxx.cn/r/common/register_tpl_shortcut.php?ico_url=http://www.abcfdsf.com/tg_play_1121.php&supplier_id=3&ep=tg&style=szsg_reg_tg03

http://b.xxx.cn?c=<IMG src="http://www.thesiteyouareon.com/somecommand.php?somevariables=maliciouscode">

提取这些嵌套URL后，可以：

对比威胁情报数据库，命中黑名单则直接标黑
不在白名单的URL可做标记供后续分析

URL提取方法对比

传统正则表达式方法

优点：实现简单
缺点：准确性不高，难以处理复杂情况

词法分析方法（本文推荐）

基于URL结构特征进行解析
准确性高，能处理各种复杂情况
借鉴自：https://blog.csdn.net/breaksoftware/article/details/7009209

URL分类与特征

1. IP形式URL

示例：192.168.1.1，10.20.11.1
结构特征：
- 4个小于255的数字
- 用.分隔
- 可能包含端口号（如:8080）

2. Domain形式URL

示例：baidu.com、www.sina.com，freebuf.com
关键特征：
- 包含顶级域名（如.com、.cn等）

3. 混合形式URL

示例：1234.com
特征：数字开头的域名

实现细节

合法字符定义

legalChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_"
legalNumers = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

顶级域名列表

包含常见顶级域名如：

topLevelDomain = ['biz', 'com', 'edu', 'gov', 'info', 'int', 'mil', 'name', 'net', 
                 'org', 'pro', 'aero', 'cat', 'coop', 'jobs', 'museum', 'travel', 
                 'arpa', 'root', 'mobi', 'post', 'tel', 'asia', 'geo', 'kid', 
                 'mail', 'sco', 'web', 'xxx', 'nato', ...]

域名提取算法

检查合法字符
查找.分隔符
验证顶级域名
处理可能存在的端口号

if self.isLegalChar(zv):
    i = 0
    reti = 0
    tokenType = TK_OTHER
    while (i < len(z) and self.isLegalChar(z[i])):
        i = i + 1
        reti = i

    while i < len(z) and z[i] == '.':
        i = i + 1
        urltoken_str = z[i:len(z)]
        urltoken_str = urltoken_str.lower()
        if urltoken_str in topLevelDomain:
            i = i + len(urltoken_str)
            reti = i
            tokenType = TK_DOMAIN
        while (i < len(z) and self.isLegalChar(z[i])):
            i = i + 1
            reti = i
        if i < len(z) and z[i] == ':':
            i = i + 1
        while (i < len(z) and z[i].isdigit()):
            i = i + 1
            reti = i
    if tokenType == TK_DOMAIN:
        check_url = z[0:i]
        if check_url.find(':') >= 0:
            check_url = check_url[0:check_url.find(':')]
        for item in topLevelDomain:
            pos = check_url.find('.' + item)
            if pos > -1 and (pos + len(item) + 1 == len(check_url)):
                self.urls.append(z[0:i])

IP地址提取算法

检查数字段
验证.分隔符
检查四段结构
处理可能存在的端口号

while (i < len(z) and z[i].isdigit()):
    i = i + 1
    ip_v1 = True
    reti = i
if i < len(z) and z[i] == '.':
    i = i + 1
    reti = i
else:
    tokenType = TK_OTHER
    reti = 1

# 类似处理其他三段...

if ip_v1 and ip_v2 and ip_v3 and ip_v4:
    self.urls.append(z[0:i])
    return reti, tokenType

测试结果

测试数据：

192.168.1.1
mp3.com
http:www.g.cn
http:\www.g.cn
http:\\/\www.g.cn
admin:@www.g.cn
http://10.10.10.10:8080/?a=1
file://192.168.1.1:8090/file
mailto:majy@corp.com
username:password@g.cn

输出结果：

192.168.1.1 ['192.168.1.1']
mp3.com ['mp3.com']
http:www.g.cn ['www.g.cn']
http:\www.g.cn ['www.g.cn']
http:\/\www.g.cn ['www.g.cn']
admin:@www.g.cn ['www.g.cn']
http://10.10.10.10:8080/?a=1 ['10.10.10.10:8080']
file://test11.com:8090/file ['test11.com:8090']
mailto:majy@corp.com ['corp.com']
username:password@g.cn ['g.cn']

代码获取

完整实现代码可在GitHub获取：
https://github.com/skskevin/UrlDetect/blob/master/tool/domainExtract/domainExtract.py

总结

本文介绍的词法分析方法相比正则表达式：

准确性更高
能处理各种复杂URL格式
可扩展性强（通过修改顶级域名列表）

这种方法特别适合安全分析场景，可以准确提取日志中的潜在恶意URL进行进一步分析。