Flume使用入门

2024-02-24 10:28:55

记录一下日志采集框架flume的相关内容，flume是由Cloudera开发，后面贡献给了Apache，是一个分布式的、稳定的，用于日志采集、汇聚和传输的系统，现在用的一般是1.x版本，老版本的因为用得少暂时不考虑。

基本概念

包括agent和event。

Agent

以下是数据流模型图，source+channel+sink组成一个agent，一个agent完成从数据源获取数据，到写出数据到目的地，它本质上是一个java进程。

（1）source，source不是数据采集的发源地，是用于对接数据源的，它是从发源地如web server采集源数据，或者从上一个agent sink的结果采集数据，可以理解为水管。

（2）channel，是一个数据临时存储区域，当数据在sink被消费后，才从channel删除，这样可以保证数据传输的安全和可靠性，可以理解成一个蓄水水箱。

（3）sink，将数据发给目的地，可以是hdfs，也可以是下一个agent采集数据的发源地等，可以理解为水轮头。

Event

Event是flume中传输的基本单位，一条消息会被封装成event对象，它本质上是一个json字符串，携带headers、body和消息体，消息是放在body里的，ETL写自定义拦截器时需要跟这两个打交道。

Flume复杂流动

上图是一个单级流动，除了单级流动flume还支持多级流动、扇入流动和扇出流动。

多级流动

如图是两个agent串联的多级流动，上一级的sink和下一级的source均是avro类型，上一级配置sink指向下一级souce的hostname和port。

扇入流

如图是扇入类型，官方文档有介绍一种应用场景，即几百个web服务器生产的日志，都汇入到十几个附属于存储子系统上的agent上。上图也可以配置完成，在agent1、agent2和agent3的sink中，指向agent4中source的hostname和port，最后agent4经过一个channel，将event统一写出到hdfs。

扇出流

如图是扇出类型，可以复制或多路输出，如果是复制则一个event将进入到三个channel，如果是多路输出，则event只会进入到符合条件的channel。这些都是通过agent中来配置的。

各大组件配置使用

三大组件包括source、channel和sink，每个组件根据业务场景的不同，会有多种type供选择，如何配置它们可以参考官方文档http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html，注意版本，新版本会添加一些功能，如type为taildir的source，就是1.7更新后添加，1.7版本也是生产环境比较稳定的版本并且和1.x的老版本兼容。

接下来就是写配置文件了，不同的场景配置文件大同小异，主体结构是一样的，都是配置三大组件，并连接source和channel，channel和sink。

source

souce根据type的不同有多种选择，包括netcat、avro、thrift、exec、jms、spooling、directory、sequence generator、syslog、http、legacy、自定义等。

（1）netcat source

netcat用于监听一个指定的端口的tcp请求数据，配置如下。

# 配置Agent，需要给Agent指定一个名字，取名a1
# 需要绑定Source，并给Source取名
a1.sources=s1
# 绑定channel，并且给Channel取名
a1.channels=c1
# 绑定Sink，并且给Sink取名
a1.sinks=k1

# 配置netcat source
a1.sources.s1.type=netcat
a1.sources.s1.bind=hadoop01
a1.sources.s1.port=8090

# 配置Channel 暂时用memory
a1.channels.c1.type=memory
# 内存空间有限，限定最大10000条event
a1.channels.c1.capacity=10000
# 批次sink的数量是100条
a1.channels.c1.transactionCapacity=100

# 配置sink，日志的形式打印
a1.sinks.k1.type=logger

# 将Source和Channel进行绑定
a1.sources.s1.channels=c1

# 将Channel和Sink进行绑定
a1.sinks.k1.channel=c1

配置好后，需启动agent，并加载上面的配置文件，

# 配置文件写在data目录下，启动flume
# agent用于启动一个agent a1 
# -c 用于加载原生配置
# -f 用户加载指定配置
# info,console代表控制台打印日志
[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./basic.conf -Dflume.root.logger=info,console

...略

# 另外开启一个窗口，通过nc访问
[root@hadoop01 ~]# nc hadoop01 8090
hello buddy
OK

# 原窗口监听到nc
2020-01-21 10:03:38,709 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/192.168.200.140:8090]
# 接收到日志信息，可以看出event由headers、body和消息体组成
2020-01-21 10:05:04,970 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 65 6C 6C 6F 20 62 75 64 64 79                hello buddy }

（2）avro source

avro source接收的是avro序列化后的数据，然后反序列化后继续传输，可以实现以上三种复杂流动。接下来使用avro client发送数据，测试接收。

# 配置agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1

# 配置avro source
a1.sources.s1.type=avro
# 监听本机
a1.sources.s1.bind=0.0.0.0
a1.sources.s1.port=8090

# 配置channel，使用memory类型
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100

# 配置sink，使用logger打印日志
a1.sinks.k1.type=logger

# 连接channel和sink、source
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

配置好后，先启动flume agent，加载上面的文件。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./avro.conf -Dflume.root.logger=info,console

然后另开一个窗口启动avro client，并测试能否接收到序列化后的数据。数据需要提前准备好。

# avrotest.txt是提前准备好的文件
[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng avro-client -H 0.0.0.0 -p 8090 -F ../testdata/avrotest.txt -c ../conf/

查看启动了agent的控制台，发现接收到了反序列化后的数据。

（3） exec source

exec source接收的是linux命令的结果，如cat、tail和ls等命令，并将结果可以输出。

# 配置agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1

# 配置exec source，这两个参考flume官方文档来配，是必选项
a1.sources.s1.type=exec
a1.sources.s1.command=ls /

# 配置channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# 配置sink
a1.sinks.k1.type=logger

# 连接source channel和sink
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

配置好后，先启动flume agent，加载上面的文件。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./exec.conf -Dflume.root.logger=info,console

执行完后可以看出输出了'ls /'的结果。

（4）sequence generator source

这个source主要用于测试，可以产生不断自增的序列数字，从0起步。如果数据流目的地最终能收到序列数字就说明能打通。

# 配置agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1

# 配置sequence generator source，必选项只有一个
a1.sources.s1.type=seq

# 配置channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# 配置sink
a1.sinks.k1.type=logger

# 连接source、channel和sink
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

配置好后，先启动flume agent，加载上面的文件。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./seqsouce.conf -Dflume.root.logger=info,console

启动后，开始快速产生序列，打印到控制台。

（5）spooling directory source

spooling directory source比较常用，它可以监视某个目录下的新文件，并写入到channel，写完后的文件会重命名添加complete标签到文件名。另外建议文件名字不要重复，可以防止报错。

# 配置agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1

# 配置spooling directory source，需要指定监视的目录/home/flumedata
a1.sources.s1.type=spooldir
a1.sources.s1.spoolDir=/home/flumedata

# 配置channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# 配置sink
a1.sinks.k1.type=logger

# 连接source、channel和sink
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

配置好后，先启动flume agent，加载上面的文件。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./spool.conf -Dflume.root.logger=info,console

给指定目录扔一个文件。

[root@hadoop01 /home]# mv book.txt ./flumedata/

查看agent的控制台，发现获取到了新增的内容。

（6）HTTP source

http source 可以接受http post和get请求，其中get请求不稳定只能用在测试，另外考虑到get请求一次只能携带64KB的数据如果是图片上传也不适用，因此一般处理post请求。http请求通过一个可插拔的'handler'(需实现HTTPSourceHandler接口)可以转换为event数组。另外参考官方文档，所有的这些转换后的event会在一次事务中提交到channel，可以提高file channel的效率。如果使用handler出现异常，会抛出http异常状态码。

可插拔handler有JSONHandler和BlobHandler。

6-a.JSONHandler可以处理json格式的event，支持UTF-8、UTF-16和UTF-32字符集，它将接受一个event数组，将其转换为flume event，转换的编码基于request请求中指定的字符集，如果没指定就默认是UTF-8，json格式的event如下所示，是一个json数组，每个json包含headers和body。

[{
  "headers" : {
             "timestamp" : "434324343",
             "host" : "random_host.example.com"
             },
  "body" : "random_body"
  },
  {
  "headers" : {
             "namenode" : "namenode.example.com",
             "datanode" : "random_datanode.example.com"
             },
  "body" : "really_random_body"
  }]

6-b.BlobHandler一般用于上传的请求，如上传PDF或JPG文件，它可以将其转换为包含请求参数和BLOB（binary large object）的event。但是这种处理器不宜处理非常大的对象因为内存大小有限，内存一次性储存一整个BLOB将可能空间不够。

下面简单的配置一个http source。

# 配置一个agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1

# 配置http source
a1.sources.s1.type=http
a1.sources.s1.port=8848

# 配置channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# 配置sink
a1.sinks.k1.type=logger

# 连接source、channel和sink
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

配置好后，先启动flume agent，加载上面的文件。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./httpsource.conf -Dflume.root.logger=info,console

另开一个窗口，使用curl来发送post请求，也可以使用火狐浏览器发送post请求。

# -X指定请求类型
# -d 指定JSON格式数据
[root@hadoop01 /home]# curl -X POST -d '[{"headers":{"year":"2019"},"body":"hello my brother"},{"headers":{"year":"2020"},"body":"wedding,money and bigdata"}]' http://0.0.0.0:8848

agent控制台有接收到http请求的数据，并转换为event格式。

channel

channel也有多种选择，一般有memory、file、jdbc和内存溢出channel。

（1）memory channel

这种channel前面用的都是它，特点就是速度快，但是会丢数据，适用于可以丢失数据的场景。

一般一条event日志大小1-2kb，生产环境下，capacity一般配置10万-30万，这样占用大小是超过1G，而transactionCapacity一般配置1000-3000。

# 指定type必须为memory
a1.channels.c1.type=memory
# channel中能存储的event的最大数量
a1.channels.c1.capacity=10000
# channel一次事务中从source获取或者向sink发送的event数量
a1.channels.c1.transactionCapacity=100

（2）file channel

这种channel的特点是速度会慢，但是可靠性有保证，数据不易丢失。

a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.bind=hadoop01
a1.sources.s1.port=8088

# channel类型必须为file
a1.channels.c1.type=file
# 配置缓存文件目录
a1.channels.c1.dataDirs=/home/check_dir
# 配置检查点文件目录
a1.channels.c1.checkpointDir=/home/checkpoint_dir

a1.sinks.k1.type=logger

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

配置好后，先启动flume agent，加载上面的文件。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./filechannel.conf -Dflume.root.logger=info,console

另外开启一个窗口，发送tcp数据到配置的ip和端口。

[root@hadoop01 /home]# nc hadoop01 8088
messi herry ronald

agent上会打印日志，只是速度会明显慢于memory channel。

flume会自动生成配置的文件夹。

刚才tcp数据，会写入到check_dir目录下的log-1文件里。

（3）jdbc channel

这个channel，从flume1.3版本开始没啥大的改动，采用derby为缓存，限于derby数据库的劣势如单线程、产生大量文件夹、局部查询的特点，比较鸡肋，可以不管。

# channel指定类型为jdbc
a1.channels.c1.type = jdbc

（4）内存溢出channel

这个channel有点类似shuffle中的环形缓冲区的意思，目前2020年1月份官方提示还是处在实验阶段，可以尝鲜，生产不宜使用。

# 配置内存溢出channel
a1.channels.c1.type = SPILLABLEMEMORY
a1.channels.c1.memoryCapacity = 10000
a1.channels.c1.overflowCapacity = 1000000
a1.channels.c1.byteCapacity = 800000
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

sink

sink根据目的地的不同，可以选择logger、file roll、hdfs、avro、thrift、ipc、file、null、hbase、solr、kafka、自定义等。

（1）logger sink

前面的sink都是这种类型，打印的info级别的日志到控制台，便于调试。从上面的日志结果来看，它的展示内容不完整，因为默认只展示16个字节数据，可以配置。

# 配置sink logger
a1.sinks.k1.type=logger
# 配置展示log的字节数，默认16个
a1.sinks.k1.maxBytesToLog=20

（2）file roll sink

这个sink会每隔一段时间将采集的日志写入到文件，可以用于文件合并。

# 配置agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1

# 配置source
a1.sources.s1.type=netcat
a1.sources.s1.port=8088
a1.sources.s1.bind=hadoop01

# 配置memory
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# 配置sink为file roll
a1.sinks.k1.type=file_roll
# 数据储存目录
a1.sinks.k1.sink.directory=/home/fileroll_sink
# 采集数据生成新文件的频率，默认30，如果设置0将只写入到一个文件中
a1.sinks.k1.sink.rollInterval=3600

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

配置好后，先启动flume agent，加载上面的文件。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./filerollsink.conf -Dflume.root.logger=info,console

这个需要提前新建一个文件夹。另外开启一个窗口发送tcp数据后，会将数据写入到文件夹目录下。

# 发送tcp数据
[root@hadoop01 /home/fileroll_sink]# nc hadoop01 8088
hello buddy
OK
# 产生日志文件
[root@hadoop01 /home/fileroll_sink]# ll
total 4
-rw-r--r--. 1 root root 12 Jan 21 16:31 1579595421201-1
# 查看，内容就是tcp发送的数据
[root@hadoop01 /home/fileroll_sink]# cat 1579595421201-1
hello buddy

（3）hdfs sink

这个sink是将数据写入到hdfs，实际较为常用。目前支持text和序列化文件格式并且可使用压缩算法，能通过设置rollInterval、rollSize和rollCount的值，自定义hdfs关闭和创建新文件的周期。flume的数据sink保存到hdfs，支持分区或分桶的方式保存文件，这个比较常见的就是根据时间戳来分区保存文件。如果保存文件的文件名需要用日期来命名区分，这个时候可以使用转义字符（可参考官方文档），sink到hdfs后目录或文件名会自动根据转义字符来替换名字，常见的就是根据日期命名了，如'%Y-%m-%d'就是按照年月日命名。使用flume sink到hdfs需要安装hdfs，flume需要用到hadoop的相关jar包来和hdfs集群来通信。

以下是一个简单的hdfs sink的示例。

# 配置agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1

# 配置source，为测试方便使用netcat
a1.sources.s1.type=netcat
a1.sources.s1.port=8848
a1.sources.s1.bind=hadoop01

# 配置sink
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# 配置hdfs sink
a1.sinks.k1.type=hdfs
# 配置文件保存地址
a1.sinks.k1.hdfs.path=hdfs://hadoop01:9000/flume
# 文件滚动生成的时间间隔
a1.sinks.k1.hdfs.rollInterval=3600
# 默认为sequenceFile，文件类型设置为DataSteam，输出不支持压缩。如果要支持压缩使用CompressedStream
a1.sinks.k1.hdfs.fileType=DataStream

# 连接source、channel和sink
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

配置好后，先启动flume agent，加载上面的文件。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./hdfssink.conf -Dflume.root.logger=info,console

另外开启一个窗口，发送tcp数据。

# 发送数据
[root@hadoop01 /home/fileroll_sink]# nc hadoop01 8848
buddy this is hdfs sink
OK

agent控制台可以看到sink成功。

查看文件内容，发现已写入文件，并且有一个.tmp后缀，因为文件正在in use，当文件关闭后后缀将移除。

# 查看内容
[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# hadoop fs -cat /flume/FlumeData.1579607270857
20/01/21 19:50:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# 就是发送的tcp数据
buddy this is hdfs sink

（4）avro

avro sink是实现复杂数据流的基础，上一个agent发送avro序列化数据后，下一个agent通过source再接收avro序列化数据，再反序列化后继续传输。

4-1.实现多级流动

按照下面配置，可以实现多级流动，最后将文件写入到hdfs。

agent1在hadoop1上设置，设置如下。

# 多级流动
a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.port=8090
a1.sources.s1.bind=hadoop01

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=avro
a1.sinks.k1.hostname=hadoop02
a1.sinks.k1.port=8888

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

View Code

agent2在hadoop2上设置，设置如下。

# 多级流动
a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=avro
a1.sources.s1.port=8888
a1.sources.s1.bind=hadoop02

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=avro
a1.sinks.k1.hostname=hadoop03
a1.sinks.k1.port=8888

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

View Code

agent3在hadoop3上设置，设置如下。

# 多级流动
a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=avro
a1.sources.s1.port=8888
a1.sources.s1.bind=hadoop03

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://hadoop01:9000/flume
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.fileType=DataStream

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

View Code

按照hadoop03、hadoop02和hadoop01的顺序启动agent，然后在hadoop01使用nc来发送一条tcp数据，测试hdfs是否有写入。

# 发送的数据
[root@hadoop01 /home/fileroll_sink]# nc hadoop01 8090
this is multiflow test by avro
OK

检查发现hdfs有写入成功，并且内容ok。

[root@hadoop01 /home/fileroll_sink]# hadoop fs -ls /flume
20/01/21 20:25:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 root supergroup         24 2020-01-21 19:49 /flume/FlumeData.1579607270857
# 写入成功
-rw-r--r--   3 root supergroup         31 2020-01-21 20:25 /flume/FlumeData.1579616023729.tmp
# 查看发现写入tcp数据内容
[root@hadoop01 /home/fileroll_sink]# hadoop fs -cat /flume/FlumeData.1579616023729.tmp
20/01/21 20:26:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
this is multiflow test by avro

4-2.实现扇出

按照下面配置，可以实现一个扇出案例。

agent1在hadoop1上设置，设置如下，使用了两种不同的channel。

# 扇出流动
a1.sources=s1
a1.channels=c1 c2
a1.sinks=k1 k2

a1.sources.s1.type=netcat
a1.sources.s1.port=8888
a1.sources.s1.bind=hadoop01

# c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100
# c2
a1.channels.c2.type=file
a1.channels.c2.dataDirs=/home/flumedata

# k1
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=hadoop02
a1.sinks.k1.port=8888
# k2
a1.sinks.k2.type=avro
a1.sinks.k2.hostname=hadoop03
a1.sinks.k2.port=8888

# c1
a1.sources.s1.channels=c1 c2
a1.sinks.k1.channel=c1
a1.sinks.k2.channel=c2

View Code

agent2在hadoop2上设置，设置如下。

# 扇出流动
a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=avro
a1.sources.s1.port=8888
a1.sources.s1.bind=hadoop02

# c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# k1
a1.sinks.k1.type=logger

# c1
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

View Code

agent3在hadoop3上设置，设置如下。

# 扇出流动
a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=avro
a1.sources.s1.port=8888
a1.sources.s1.bind=hadoop03

# c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# k1
a1.sinks.k1.type=logger

# c1
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

View Code

按照hadoop03、hadoop02和hadoop01的顺序启动agent，然后在hadoop01使用nc来发送一条tcp数据，测试hadoop02和hadoop03控制台是否有打印。

# 发送tcp数据
[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# nc hadoop01 8888
this is fan out flow test by avro
OK

发送完成后，hadoop02和hadoop03的控制台都有输出。

4-3.实现扇入

按照下面配置，可以实现一个扇入案例。

agent1在hadoop1上设置。

# 扇入流动
a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.port=8888
a1.sources.s1.bind=0.0.0.0

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=avro
a1.sinks.k1.hostname=hadoop03
a1.sinks.k1.port=8888

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

View Code

agent2在hadoop2上设置，和hadoop01一样。

agent3在hadoop3上设置。

# 扇入流动
a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=avro
a1.sources.s1.port=8888
a1.sources.s1.bind=hadoop03

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=logger

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

View Code

按照hadoop03、hadoop02和hadoop01的顺序启动agent，然后hadoop01和hadoop02节点发送tcp数据包，测试hadoop03能否接收到。

# hadoop01发送tcp数据
[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# nc hadoop01 8888
from 01 node
OK
# hadoop02发送tcp数据
[root@hadoop02 /home/software/apache-flume-1.6.0-bin/data]# nc hadoop02 8888
from 02 node
OK

hadoop01和hadoop02发送tcp数据后，hadoop03都接收到了。

souce selector

选择器selector是souce的子组件，要理解它可以从上面的扇出来理解。

当不设置时，选择器默认是replicating模式，即复制模式，source获取到数据后会复制一份，然后发送到两个节点。

如果要根据event消息头的不同分发到不同的节点，需要将选择器设置为multiplexing，即路由模式，并在配置文件中指定区分的基准。

以下是使用了选择器的souce，修改上面扇出类型的agent1的配置。

# 扇出流动
a1.sources=s1
a1.channels=c1 c2
a1.sinks=k1 k2

# source使用http，好模拟
a1.sources.s1.type=http
a1.sources.s1.port=8888
# 配置选择器，选择类型为多路复用模式，即路由模式
a1.sources.s1.selector.type=multiplexing
# 选择器识别的消息头
a1.sources.s1.selector.header=name
# 消息头为messi的发给c1 channel
# 消息头为ronald的发给c2 channel
a1.sources.s1.selector.mapping.messi=c1
a1.sources.s1.selector.mapping.ronald=c2
# 没有匹配到的发送c2
# a1.sources.s1.selector.default=c2

# c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100
# c2
a1.channels.c2.type=file
a1.channels.c2.dataDirs=/home/flumedata

# sink1
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=hadoop02
a1.sinks.k1.port=8888
# sink2
a1.sinks.k2.type=avro
a1.sinks.k2.hostname=hadoop03
a1.sinks.k2.port=8888

# 连接source、channel和sink，有几个sink就有几个channel
a1.sources.s1.channels=c1 c2
a1.sinks.k1.channel=c1
a1.sinks.k2.channel=c2

参考扇出，按照hadoop03、hadoop02和hadoop01的顺序启动agent，只是hadoop01加载的是新的配置文件。启动后hadoop01上curl模拟发送post请求，分别携带不同的请求头信息，测试是否发送不同的节点。

# 启动后发送post请求
[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# curl -X POST -d '[{"headers":{"name":"messi"},"body":"i am messi"}]' http://hadoop01:8888
[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# curl -X POST -d '[{"headers":{"name":"ronald"},"body":"i am ronald"}]' http://hadoop01:8888

可以看到，两个post请求分别发送到了不同的节点。

souce interceptor

拦截器也是source的子组件，它可以对数据进行改变。通过拦截器，可以根据自己的需求来过滤数据，并且拦截器也可以构成拦截器链，类似spring中的拦截器，也是责任链模式。

（1）timestamp interceptor

配置时间戳拦截器，event消息头里将包含毫秒为单位的时间戳，以下是配置示例。

a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.bind=0.0.0.0
a1.sources.s1.port=8090
# 配置时间戳拦截器，可以配置多个，空格隔开
a1.sources.s1.interceptors=i1
a1.sources.s1.interceptors.i1.type=timestamp

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=logger

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

启动agent后，发送tcp数据包，发现event消息头里有时间。

上面hdfs sink有提到可以按照日期来命名文件或文件夹，这个可以配合拦截器，实现按天保存数据。

a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.bind=0.0.0.0
a1.sources.s1.port=8090
# 配置时间戳拦截器
a1.sources.s1.interceptors=i1
a1.sources.s1.interceptors.i1.type=timestamp

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=hdfs
# 按天输出到文件夹，使用转义字符
a1.sinks.k1.hdfs.path=hdfs://hadoop01:9000/flume/date=%Y-%m-%d
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.fileType=DataStream

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

运行agent后，发现在hdfs按照日期生成了目录，文件按天存放在目录。

（2）host interceptor

配置后，会在event的headers里展示运行agent的主机ip或名字。

a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.bind=0.0.0.0
a1.sources.s1.port=8090
a1.sources.s1.interceptors=i1 i2
a1.sources.s1.interceptors.i1.type=timestamp
# 配置host interceptor
a1.sources.s1.interceptors.i2.type=host
# 如果想显示主机名，设置false
# a1.sources.s1.interceptors.i2.useIP=false

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=logger

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

启动agent后，发送tcp数据包，发现event消息头里有ip。

（3）static interceptor

如果要自定添加一个key-value到消息头，就可以使用它。

a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.bind=0.0.0.0
a1.sources.s1.port=8090
a1.sources.s1.interceptors=i1 i2 i3
a1.sources.s1.interceptors.i1.type=timestamp
a1.sources.s1.interceptors.i2.type=host
# 配置static interceptor,可以配置多个，再添加个i4即可
a1.sources.s1.interceptors.i3.type=static
a1.sources.s1.interceptors.i3.key=name
a1.sources.s1.interceptors.i3.value=messi

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=logger

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

启动agent后，发送tcp数据包，发现event消息头里有自定义的key-value。

（4）UUID interceptor

这个就是让消息头里添加一个全局唯一标识的id。

a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.bind=0.0.0.0
a1.sources.s1.port=8090
a1.sources.s1.interceptors=i1 i2 i3 i4
a1.sources.s1.interceptors.i1.type=timestamp
a1.sources.s1.interceptors.i2.type=host
a1.sources.s1.interceptors.i3.type=static
a1.sources.s1.interceptors.i3.key=name
a1.sources.s1.interceptors.i3.value=clyang
# 添加UUID interceptor,有$Builder，说明这使用了内部静态类
a1.sources.s1.interceptors.i4.type=org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=logger

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

启动agent后，发送tcp数据包，发现event消息头里有UUID。

（5）search and replace interceptor

如果需要替换和搜索某些字符串，可以使用这个拦截器，需要配合正则表达式。

a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.bind=0.0.0.0
a1.sources.s1.port=8090
a1.sources.s1.interceptors=i1 i2 i3 i4 i5
a1.sources.s1.interceptors.i1.type=timestamp
a1.sources.s1.interceptors.i2.type=host
a1.sources.s1.interceptors.i3.type=static
a1.sources.s1.interceptors.i3.key=name
a1.sources.s1.interceptors.i3.value=clyang
a1.sources.s1.interceptors.i4.type=org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
# 搜索和替换拦截器
a1.sources.s1.interceptors.i5.type=search_replace
# 正则
a1.sources.s1.interceptors.i5.searchPattern=[0-9]
# 匹配后替换后的字符
a1.sources.s1.interceptors.i5.replaceString=XJ

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=logger

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

另开一个窗口，发送一条tcp数据，里面的数字会被替换掉。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# nc hadoop01 8090
20 i miss you
OK

控制台的20已结替换成XJ了。

（6）regex filter interceptor

它可以解析事件体，根据正则表达式来筛选。

a1.sources=s1
a1.channels=c1
a1.sinks=k1

a1.sources.s1.type=netcat
a1.sources.s1.bind=0.0.0.0
a1.sources.s1.port=8090
# 配置regex filter interceptor
a1.sources.s1.interceptors=i1
a1.sources.s1.interceptors.i1.type=regex_filter
# 写正则表达式
a1.sources.s1.interceptors.i1.regex=[A-Z]
# 将匹配的event排除，不采集
a1.sources.s1.interceptors.i1.excludeEvents=true

a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

a1.sinks.k1.type=logger

a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

另开一个窗口，发送一条tcp数据，如果消息全是大写字母，agent控制台将不打印日志，小写字母将采集。

# 发送两条，一条全大写，一条全小写
[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# nc hadoop01 8090
REGEX INTERCEPTOR
OK
regex interceptor
OK

agent控制台接收结果，大写的没有接收到，小写的打印到了控制台。

sink processor

sink processor是用于处理sink组的，因为单个sink存在宕机的可能，为了实现故障转移，将需要配置sink processor。另外sink processor还可以实现sink的负载均衡。

（1）default sink processor

它只能处理一个sink，无需配置sink组，按照普通的source→channel→sink来配置就可以。

（2）failover sink processor

它可以实现故障转移，使用它需要配置一个sink组。failover sink processor内部维护着一个具有优先级的sink列表，可以保证可用的event被有效处理。当一个sink处理event失败，将降级到pool中，并且设定了CD时间，需CD时间达到才能再次使用，一旦再次使用时成功处理了一条event，它将升级到live pool。

另外，可以在配置文件中为sink设置优先级，用数字来表示，数字越大优先级越高，注意数字不能重复。如果某个sink处理event失败，sink组里剩下的选取一条优先级最高的继续处理，如果sink组里的sink没有指定优先级，则默认按照配置的顺序来选择。

下面简单的设置一个，基于扇出，因为扇出需要配置多个sink，将这些sink配置放到一个sink group并配置sink优先级，测试发送数据哪个agent将接收到，且一个sink关闭能否切换到另外一个sink来处理数据，示意图如下，蓝色框代表一个agent，绿色框代表一个sink group。

agent1配置在hadoop01，另外两个agent分别配置在hadoop02和hadoop03。

hadoop01上配置。

# 扇出流动
a1.sources=s1
a1.channels=c1 c2
a1.sinks=k1 k2

# 指定sinkgroup组的名字
a1.sinkgroups=g1
# 组下有两个sink
a1.sinkgroups.g1.sinks=k1 k2
# 组的处理器类型为failover
a1.sinkgroups.g1.processor.type=failover
# 配置sink的优先级，k1对应的优先级要高
a1.sinkgroups.g1.processor.priority.k1=8
a1.sinkgroups.g1.processor.priority.k2=2
a1.sinkgroups.g1.processor.maxpenalty=5000

# 配置source
a1.sources.s1.type=netcat
a1.sources.s1.port=8888
a1.sources.s1.bind=node01

# channel1 为memory
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100
# channel2 为file
a1.channels.c2.type=file
a1.channels.c2.dataDirs=/home/flumedata

# sink1 到hadoop02
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=hadoop02
a1.sinks.k1.port=8888
# sink2 到hadoop03
a1.sinks.k2.type=avro
a1.sinks.k2.hostname=hadoop03
a1.sinks.k2.port=8888

# 连接source、channel和sink
a1.sources.s1.channels=c1 c2
a1.sinks.k1.channel=c1
a1.sinks.k2.channel=c2

hadoop02和hadoop03的配置和上面扇出的配置一样。

分别启动hadoop03、hadoop02和hadoop01，并且在hadoop01不段发送tcp数据包，查看hadoop02和hadoop03的日志输出情况。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# nc hadoop01 8888
# 连续发送3条
this from 01
OK
this from 01
OK
this from 01
OK
# hadoop02关闭agent后再次发送
tis from 01
OK

刚开始连续发送3条，发现只有hadoop02的agent有日志输出，hadoop03没有，因为hadoop02上的sink优先级更高。

关闭hadoop02的agent后，hadoop01上的agent会报错sink k1不可用，进行故障转移到可用的sink，即k2，而k2在hadoop03节点。

查看hadoop03，发现故障转移前和后的所有消息都接收到了，一共四条，即实现了故障转移。

（3）load balancing sink processor

它可以实现sink组内活动sink的负载均衡，有轮询（round-robin）和随机（random）两种策略。基于上面扇出的例子，只需修改hadoop01上配置文件就可以实现测试，以轮询为例。

a1.sources=s1
a1.channels=c1
a1.sinks=k1 k2

# 指定sinkgroup组的名字
a1.sinkgroups=g1
a1.sinkgroups.g1.sinks=k1 k2
# 配置为load_balance
a1.sinkgroups.g1.processor.type=load_balance
#  轮询策略
a1.sinkgroups.g1.processor.selector=round_robin

a1.sources.s1.type=netcat
a1.sources.s1.port=8888
a1.sources.s1.bind=0.0.0.0

# c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# k1
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=hadoop02
a1.sinks.k1.port=8888
# k2
a1.sinks.k2.type=avro
a1.sinks.k2.hostname=hadoop03
a1.sinks.k2.port=8888

# 连接source、channel和sink
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1
a1.sinks.k2.channel=c2

也同样启动三台agent后，hadoop01节点上不断发送tcp数据，发现hadoop02和hadoop03上轮流收到信息，测试ok。

hadoop02接受部分信息。

hadoop03接受部分信息。

自定义source

可以自定义一个source，模仿上文的sequence generator source。

（1）代码部分。

package com.boe;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDrivenSource;
import org.apache.flume.channel.ChannelProcessor;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.source.AbstractSource;

import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

/**
 * 自定义一个source，用来实现仿造sequenceSource
 * 可以实现EventDrivenSource或者PollableSource
 * EventDrivenSource 需要自己提供线程将数据放入到channel,可控
 * PollableSource PollableSource提供线程将数据放入到channel，不可控
 * Configurable 用于获取配置文件中的指定属性
 */
public class AuthSource extends AbstractSource implements EventDrivenSource,Configurable{
    private int step=1;
    private ExecutorService es;

    //获取配置文件中指定步长参数step
    @Override
    public void configure(Context context) {
        //属性名为步长
        String s = context.getString("step");
        if(s!=null){
           //判断step是否是正数
            int i = Integer.parseInt(s);
            if(i<=0){
               throw new IllegalArgumentException();
            }else{
               step=i;
            }
        }
    }

    @Override
    public synchronized void start() {
        //启动5个线程
        es= Executors.newFixedThreadPool(5);
        //获取到一个channel
        ChannelProcessor cp=super.getChannelProcessor();
        //提交任务完成自增过程
        es.submit(new Add(step,cp));
    }

    @Override
    public synchronized void stop() {
        es.shutdown();
    }
}

class Add implements Runnable{
    private int step;
    private ChannelProcessor cp;

    public Add(int step,ChannelProcessor cp){
        this.step=step;
        this.cp=cp;
    }

    @Override
    public void run() {
        int i=0;
        while(true){
            i+=step;
            //将数据给到channel
            Map<String,String> map=new HashMap<>();
            map.put("time",System.currentTimeMillis()+"");
            //第一个参数为body，第二个为headers
            Event event= EventBuilder.withBody((i + "").getBytes(),map);
            cp.processEvent(event);
        }
    }
}

View Code

（2）打包。

在IDEA中将其打包，将其添加到flume的lib目录下。如果是maven项目直接package打包即可，普通项目需要先添加一个artifact，然后build artifact后打包。注意最好删除extract，否则jar包扔进去，启动agent会报错。

打包后，在IDEA下使用terminal，将文件远程拷贝到flume的lib目录下即可。

（3）写配置文件，测试。

自定义source后，需要在配置文件中指定类的全路径名。

# 配置agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1

# 配置自定义source
a1.sources.s1.type=com.boe.AuthSource
# 步长为10
a1.sources.s1.step=10

# 配置channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# 配置sink
a1.sinks.k1.type=logger

# 连接source、channel和sink
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

配置好后，先启动flume agent，加载上面的文件。

[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# ../bin/flume-ng agent -n a1 -c ../conf/ -f ./auto.conf -Dflume.root.logger=info,console

agent控制台可以看到按照步长10来打印了。

自定义interceptor

ETL数据清洗中，为满足个性化的拦截需求，这个时候可以自定义拦截器，同样将jar包扔到flume的lib目录下生效。下面实现一个自定义拦截器，实现根据日志内容关键字清洗数据，然后保存到hdfs的功能。

（1）需要自定义拦截器，实现Interceptor接口。

package com.boe;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.List;

/**
 * 自定义拦截器，实现根据关键字清洗数据
 */
public class ETLInterceptor implements Interceptor{
    //可以不需要写
    @Override
    public void initialize() {

    }

    //拦截数据方法
    @Override
    public Event intercept(Event event) {

        //获取event中body数据
        byte[] body = event.getBody();
        String log = new String(body, Charset.forName("UTF-8"));

        //body中包含以下字符，过滤掉
        if(!log.contains("ass")&&!log.contains("fuck")&&!log.contains("bitch")){
            return event;
        }
        return null;
    }

    //如果event是一个列表的情况，这个方法需要调用上面的方法
    @Override
    public List<Event> intercept(List<Event> list) {
        ArrayList<Event> events=new ArrayList<>();
        for (Event event:list){
            //调用上面方法判断
            Event event1=intercept(event);
            if(event1!=null){
                events.add(event1);
            }
        }
        return events;
    }

    //这个可以不写
    @Override
    public void close() {

    }

    //大类里面创建一个静态类
    public static class Builder implements Interceptor.Builder{

        //重写这个方法即可
        @Override
        public Interceptor build() {
            return new ETLInterceptor();
        }

        @Override
        public void configure(Context context) {
        }
    }
}

写完代码后，打成jar包扔到flume的lib目录下。

（2）写一个配置文件，引入自定义拦截器。

# agent
a1.sources=s1
a1.channels=c1
a1.sinks=k1

# source
a1.sources.s1.type=netcat
a1.sources.s1.port=8848
a1.sources.s1.bind=hadoop01
# 使用自定义拦截器
a1.sources.s1.interceptors=i1
a1.sources.s1.interceptors.i1.type=com.boe.ETLInterceptor$Builder

# channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=100

# sink
a1.sinks.k1.type=hdfs
# 目录是myinterceptor
a1.sinks.k1.hdfs.path=hdfs://hadoop01:9000/myinterceptor
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.fileType=DataStream

# 连接source、channel和sink
a1.sources.s1.channels=c1
a1.sinks.k1.channel=c1

（3）启动agent，测试发送tcp数据包，可以看出，需要过滤的关键字信息被过滤掉了，只展示正常的消息。

# 注意里面有敏感字，测试能否被interceptor清洗掉
[root@hadoop01 /home/software/apache-flume-1.6.0-bin/data]# nc hadoop01 8848
messi is best football player
OK
ronald?
OK
fuck
OK
ass
OK
sorry
OK

清洗成功，hdfs只写入了正常的信息。

以上，就是flume的使用入门，记录一下，后面查看用。

参考博文：

（1）flume官方文档

码农公寓

基本概念

Agent

Event

Flume复杂流动

多级流动

扇入流

扇出流

各大组件配置使用

source

channel

sink

souce selector

souce interceptor

sink processor

自定义source

自定义interceptor

相关文章