[日志服务][数据加工]e_anchor函数与e_regex函数的使用总结

2022-03-19 09:29:09

e_anchor提取方式是基于字符串下标进行提取的，其优点具有方便、快捷、效率高等特点。但是缺点也很明显灵活性不够强，通用性比较差，适用于前后缀标识比较明显，规律性比较强的字符串提取。

正则提取方式众所周知是通过正则表达式对字符串提取的，优点是灵活性、逻辑性和功能性非常强，基本上很多字符串提取问题都能使用此方式解决，并且可以迅速地用极简单的方式达到字符串的复杂控制。但是缺点也是不言而喻的，与e_anchor提取模式相比性能会差一些，并且对与刚接触的人来说，比较晦涩难懂。从而导致学习成本高、上手难度大、不适合新手快速解决自身遇到的问题等。

本文将通过具体实例，向大家讲述使用e_anchor和正则的适用场景

解析自定义日志文本

e_anchor函数适合解决一些单个，或者多个规律性比较强，有明显前缀标识的文本字符串，比如遇到以下类型日志:

# 日志1：
__source__:  1.1.16.15
__tag__:__client_ip__:  12.1.75.140
__tag__:__receive_time__:  1563443076
content: Aug 2 04:06:08: host=10.1.1.124: local/ssl2 notice mcpd[3772]: User=jsmith@demo.com: severity=warning: 01070638:5: Pool member 172.31.51.22:0 monitor status down. 
# 日志2:
__source__:  1.1.16.15
__tag__:__client_ip__:  12.1.75.140
__tag__:__receive_time__:  1563443077
content: Feb 5 3:15:09: host=1.1.1.1: local/ssl2 error mcpd[1222]: User=twiss@aliyun.com: severity=error: 01070639:6: Pool member 192.168.1.1 monitor status invalid.

从日志结构上分析可以看出日志信息都包含以下字段名：其中host、mcpd、User、severity是原始日志固定字段。

LOG DSL编排

本部分将提供两种方案，解析上述日志文本。

方案一：e_anchor解析

# Aug 2 04:06:08: host=10.1.1.124: local/ssl2 notice mcpd[3772]: User=jsmith@demo.com: severity=warning: 01070638:5: Pool member 172.31.51.22:0 monitor status down. 

e_anchor("content", "*: host=*: local/ssl2 * mcpd[*]: User=*: severity=*: * Pool member * monitor status *.", ["time", "host","level", "mcpd", "user_field","severity_field","*","pool_member", "monitor_status"])

预览处理日志：

# 日志1
content: Aug 2 04:06:08: host=10.1.1.124: local/ssl2 notice mcpd[3772]: User=jsmith@demo.com: severity=warning: 01070638:5: Pool member 172.31.51.22:0 monitor status down.
time: Aug 2 04:06:08
host: 10.1.1.124
level: notice
mcpd: 3772
user_field: jsmith@demo.com
severity_field: warning
pool_member: 172.31.51.22:0
monitor_status: down

# 日志2
content: Feb 5 3:15:09: host=1.1.1.1: local/ssl2 error mcpd[1222]: User=twiss@aliyun.com: severity=error: 01070639:6: Pool member 192.168.1.1 monitor status invalid.
time: Feb 5 3:15:09
host: 1.1.1.1
level: error
mcpd: 1222
user_field: twiss@aliyun.com
severity_field: error
pool_member: 192.168.1.1
monitor_status: invalid

方案二：e_regex正则解析

e_regex("content",r'(?P<time>(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?) (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]) (?:2[0123]|[01]?[0-9]):(?:[0-5][0-9]):(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)): host=(?P<host>(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])): local\/ssl2 [a-z]+ mcpd\[(?P<mcpd>[0-9]+)\]: User=(?P<user_field>[a-zA-Z][a-zA-Z0-9_.+-=:]+@\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)): severity=(?P<severity_field>[a-z]+): (?P<temp_field>[0-9]+:[0-9]+:) Pool member (?P<pool_member>(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9]):[0-9]+) [a-z]+ [a-z]+ (?P<monitor_status>[a-z]+).')

grok函数解析详见数据加工文档-grok函数，正则与grok函数对比，详见解析Nginx日志方案对比：

e_regex("content",grok('%{DATE_TIME:time}: host=%{IP:host}: local/ssl2 %{USERNAME:level} mcpd\[%{NUMBER:mcpd}\]: User=%{EMAILADDRESS:user_field}: severity=%{USERNAME:severity_field}: %{NUMBER}:%{NUMBER}: %{USERNAME} %{USERNAME} %{IP:pool_member}:%{NUMBER} %{USERNAME} %{USERNAME} %{USERNAME:monitor_status}.',extend={'DATE_TIME': '%{MONTH} %{MONTHDAY} %{TIME}'}))

预览处理日志：

# 日志1
content: Aug 2 04:06:08: host=10.1.1.124: local/ssl2 notice mcpd[3772]: User=jsmith@demo.com: severity=warning: 01070638:5: Pool member 172.31.51.22:0 monitor status down.
time: Aug 2 04:06:08
host: 10.1.1.124
level: notice
mcpd: 3772
user_field: jsmith@demo.com
severity_field: warning
pool_member: 172.31.51.22:0
monitor_status: down

# 日志2
content: Feb 5 3:15:09: host=1.1.1.1: local/ssl2 error mcpd[1222]: User=twiss@aliyun.com: severity=error: 01070639:6: Pool member 192.168.1.1 monitor status invalid.
time: Feb 5 3:15:09
host: 1.1.1.1
level: error
mcpd: 1222
user_field: twiss@aliyun.com
severity_field: error
pool_member: 192.168.1.1
monitor_status: invalid

从加工结果上分析可以看出使用anchor的本质跟据不变的边界提取变化的值，所以可能会有一部分变化的值被提取出来但不一定会被使用，不需要的值可以在fields中命名为*详细参考e_anchor中的fields参数说明。
从加工规则上看方案一e_anchor加工规则更加容易理解上手，语法更简单。而方案二正则加工规则却晦涩难懂，并且容易出错，并且加工性能方面不如方案一。

对比

e_anchor函数

该函数灵活度不够强，比如以上日志形式改成如下：

# 日志1
content: Aug 2 04:06:08: 10.1.1.124: local/ssl2 notice mcpd[3772]: jsmith@demo.com: warning: 01070638:5: 172.31.51.22:0 down.
# 日志2
content: Feb 5 3:15:09: 1.1.1.1: local/ssl2 error mcpd[1222]: twiss@aliyun.com: error: 01070639:6: 192.168.1.1:0 invalid.

这种没有明显通用前后缀标识形式的日志文本（因为时间中也有:导致标识边界不明显），使用e_anchor函数很难将全部的content解析出来，只能对单个的content有效，不具有通用性。

e_regex函数

这种形式的日志文本，用正则解析是能够解析出来的，具体操作如下：

e_regex("content","(?P<time>\b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?) (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]) (?:2[0123]|[01]?[0-9]):(?:[0-5][0-9]):(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)): (?P<host>(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])): local\/ssl2 [a-z]+ mcpd\[(?P<mcpd>[0-9]+\]): (?P<user_field>[a-zA-Z][a-zA-Z0-9_.+-=:]+@\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)): (?P<severity_field>[a-z]+): (?P<temp_field>[0-9]+:[0-9]+:) (?P<pool_member>(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9]):[0-9]+) (?P<monitor_status>[a-z]+).")

预览处理日志：

# 日志1
content: Aug 2 04:06:08: 10.1.1.124: local/ssl2 notice mcpd[3772]: jsmith@demo.com: warning: 01070638:5: 172.31.51.22:0 down.
time: Aug 2 04:06:08
host: 10.1.1.124
level: notice
mcpd: 3772
user_field: jsmith@demo.com
severity_field: warning
pool_member: 172.31.51.22:0
monitor_status: down

# 日志2
content: Feb 5 3:15:09: 1.1.1.1: local/ssl2 error mcpd[1222]: twiss@aliyun.com: error: 01070639:6: 192.168.1.1:0 invalid.
time: Feb 5 3:15:09
host: 1.1.1.1
level: error
mcpd: 1222
user_field: twiss@aliyun.com
severity_field: error
pool_member: 192.168.1.1
monitor_status: invalid

从加工的结果来看，解析出来的内容，与之前一样。在上述加工规则中，正则表达式只是把之前的前缀标识去掉，规则改动并不大，也能够把全部的content解析出来。

总结

1、有通用前后缀标识文本解析

日志

# 日志1
content: Time=10/Jun/2020:11:32:16 +0800; Host=m.zf.cn; Method=GET; Url=http://aliyun/zf/11874.html; 
# 日志2
content: Time=11/Feb/2020:12:22:10 +0800; Host=sls.aliyun.cn; Method=POST; Url=http://aliyun/sls/1235.html;

以上日志形式都有通用的前缀标识如：Time=, Host=, Method=, Url=后缀标识:;即每个字段信息结尾都有一个分号。类似这种形式日志推荐使用e_anchor来做文本解析。

解析规则

# e_anchor提取
e_anchor("content","Time=*; Host=*; Method=*; Url=*;",["time","host","method","url"])
# 正则提取
e_regex("content","Time=(?<time>\b(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])\/\b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b\/(?>\d\d){1,2}:(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9]):(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)\s\+[0-9]{4}); Host=(?<host>\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)); Method=(?<method>[a-zA-Z]+); Url=(?<url>[a-zA-Z][a-zA-Z0-9_.+-=:]+);")
# e_kv提取
e_kv("content",sep="=", quote='"')

如果考虑加工速率，e_kv虽然语法短但是本质上也是使用正则做的提取，而e_anchor本质是使用字符串下标做解析的，所以推荐使用e_anchor函数解析。

2、有无通用前(后)缀标识混合文本解析

日志

# 日志1
content: 10/Jun/2020:11:32:16 +0800; m.zf.cn; Method=GET; Url=http://aliyun/zf/11874.html; 
# 日志2
content: 11/Feb/2020:12:22:10 +0800; sls.aliyun.cn; Method=POST; Url=http://aliyun/sls/1235.html;

以上日志形式通用的前缀标识如：Method=, Url=后缀标识:;即每个字段信息结尾都有一个分号。但是，像time和hostname信息没有明显的前缀标识。类似这种形式日志也可以使用e_anchor来做文本解析。

解析规则

# e_anchor提取
e_anchor("content","*; *; Method=*; Url=*;",["time","host","method","url"])
# 正则提取
e_regex("content","(?<time>\b(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])\/\b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b\/(?>\d\d){1,2}:(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9]):(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)\s\+[0-9]{4}); (?<host>\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)); Method=(?<method>[a-zA-Z]+); Url=(?<url>[a-zA-Z][a-zA-Z0-9_.+-=:]+);")

此类型的日志使用e_kv是不能够完全解析出来。所以这种形式的日志推荐使用e_anchor函数

3、无明显通用前或后缀标识文本解析

日志

# 日志1
content: Aug 10: 11:12:03: Twiss Programmer Logon GUID is a unique identifier that can be used to correlate this event with a KDC event.
# 日志2
content: Feb 11: 10:00:00: Iran VC This will be 0 if no session key was requested.

无明显前后缀标识日志

解析规则

e_regex("content","(?P<time>(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?) (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]): (?:2[0123]|[01]?[0-9]):(?:[0-5][0-9]):(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)): (?P<name>[a-zA-Z]+) (?P<job>[a-zA-Z]+) (?P<msg>(.*)$)")

如果使用e_anchor规则解析的话解析起来会比较冗余，需要把month, day, time单独提取出来，如:

e_anchor("content","* *: *: * * *","month,day,time,user,job,msg")

如果把日期时间作为整体提取如：

e_anchor("source","*: * * *","time,user,job,msg")
# 提取日志为
"""
content : Feb 11: 10:00:00: Iran VC This will be 0 if no session key was requested.
time : Feb 11
user: 10:00:00:
job : Iran
msg : VC This will be 0 if no session key was requested.
"""

明显看出以上e_anchor把日期时间作为整体解析出来的日志是错误的。因此，此类型的日志前缀和后缀标识不明显推荐e_regex函数解析。

4、`*`特殊字符作为前后缀标识文本

日志

# 日志1
content: Aug 10 11:12:03* Twiss* Programmer;
# 日志2
content: Feb 11 10:00:00* Iran* VC;

此类型日志无明显前缀标识，有明显的*标做后缀。

解析规则

正则解析：

e_regex("content","(?P<time>(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?) (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]) (?:2[0123]|[01]?[0-9]):(?:[0-5][0-9]):(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))\* (?P<user>[a-zA-Z]+)、* (?P<Job>[a-zA-Z]+);")

以下使用e_anchor解析方式是错误的（是一个错误示例）：

e_anchor("content","** **; *;",["time","user","job"])

e_anchor中的提取规则不支持**这种形式。
此类型以*特殊符号作为明显前后标识的文本，适合使用e_regex函数，不适合使用e_anchor函数。

5、e_anchor函数和e_regex函数适用场景表

场景	e_anchor函数	e_regex函数
有通用前后缀标识文本解析	适合（推荐使用）	适合
有无通用前(后)缀标识混合文本解析	适合（推荐使用）	适合
无明显通用前或后缀标识文本解析	不适合	适合（推荐使用）
`*`特殊字符作为前后缀标识文本	不适合	适合（推荐使用）

码农公寓

解析自定义日志文本

LOG DSL编排

方案一：e_anchor解析

方案二：e_regex正则解析

对比

e_anchor函数

e_regex函数

总结

1、有通用前后缀标识文本解析

日志

解析规则

2、有无通用前(后)缀标识混合文本解析

日志

解析规则

3、无明显通用前或后缀标识文本解析

日志

解析规则

4、*特殊字符作为前后缀标识文本

日志

解析规则

5、e_anchor函数和e_regex函数适用场景表

相关文章

4、`*`特殊字符作为前后缀标识文本