大数据学习教程SD版第九篇【Flume】

Flume 日志采集工具,既然是工具,还是以使用为主!

分布式 采集处理和聚合 流式框架

通过编写采集方案,即配置文件,来采集数据的工具,配置方案在官方文档

1. Flume 架构

大数据学习教程SD版第九篇【Flume】

  • Agent JVM进程
  1. Source :接收数据
  2. Channel :缓冲区
  3. Sink:输出数据
  • Event 传输单元

2. Flume 安装

Java 和 Hadoop 的环境变量提前配置好,此时解压即用!

3. Flume 官方示例

不同的sink、channel、sink 配置官方文档都有示例

# example.conf : port -> console
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令

bin/flume-ng agent -c conf -f jobs/example.conf -n a1 -Dflume.root.logger=INFO,console

传输数据

# yum install -y nc
nc localhost 44444

4. Flume 示例

4.1 File New Context -> HDFS

采集文件新增内容至HDFS,不能断点续传

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /data/test.log

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%Y%m%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 24
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.fileType = DataStream

a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动

bin/flume-ng agent -c conf -f jobs/log2hdfs.conf -n a1

4.2 Dir New File -> HDFS

采集目录下新文件到HDFS,不能监控文件内容变化

a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = c1
a1.sources.src-1.spoolDir = /data/data1
a1.sources.src-1.fileHeader = true

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

启动

bin/flume-ng agent -c conf -f jobs/file2hdfs.conf -n a1

4.3 Dir New FIle And Context -> HDFS

能够 监控多目录下文件及文件内容变化至HDFS,能够断点续传,log4j下日志会更名,而文件更名则会重新上传

a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /data/data2/.*file.*
a1.sources.r1.filegroups.f2 = /data/data3/.*log.*
a1.sources.ri.maxBatchCount = 1000

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events2/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

启动

bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1

[{“inode”:786450,“pos”:1501,“file”:"/data/data2/file1.txt"} ] 源码是根据inode和file 共同定位到一个文件

如果处理文件更名的问题,修改 TailFile.java 123 和 ReliableTaildirEventReader.java 256 重新打包,替换libs下的tairdirsource的jar包

5. Flume 事务

Source 推送事件到Channel ,Sink从Channel拉取事件,都是先进临时缓冲区

  1. Source -> Channel doPut putList 回滚是直接清空Channel队列数据,有可能丢数据,有位置记录则不会

  2. Channel -> Sink doTake takeList 回滚是把拉取数据反向写回Channel队列,有可能数据重复

6. Flume Agent 原理

  1. Source 接收数据
  2. Source -> Channel Processor 处理事件
  3. Channel Processor -> Interceptor 事件拦截与过滤
  4. Channel Processor -> Channel Selector : 默认 replicating ,还有 multiplexing
  5. Channel Processor -> Channel n : event 写入channel
  6. Channel -> Sink Processor : 三种 :默认 Default 【一个Sink】、LoadBalancing【负载均衡】、Failover【故障转移】
  7. Sink Processor -> Sink : 写入Sink

7. Flume 拓扑结构

借助于 Avro 来连接 多个Flume agent

轮询策略:Sink没拉到数据换Sink

  1. 简单串联:Sink -> Source
  2. 复制和多路复用: 多Channel -> 多Sink
  3. 负载均衡和故障转移:Channel -> 多Sink
  4. 聚合:多Sink -> Source

8. Flume 自定义Interceptor

自定义Interceptor 实现多路复用 :

  1. 通过 Header 信息不同进入不同的Channel

  2. 采集到包含Error 和Exception 的信息,进入一个Channel,其他进入另一个Channel

  3. 各个Channel Sink输出到控制台

  1. 编码自定义Interceptor
package com.ipinyou.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class TypeInterceptor implements Interceptor {

    private List<Event> eventList;

    @Override
    public void initialize() {
        eventList = new ArrayList<>();
    }

    @Override
    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody());
        if (body.contains("Error") || body.contains("Exception")) {
            headers.put("type", "error");
        } else {
            headers.put("type", "normal");
        }
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
        eventList.clear();
        for (Event event : list) {
            eventList.add(intercept(event));
        }
        return eventList;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new TypeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}
  1. 打包上传至Flume的lib目录下
  2. 编写采集方案

flume-s1-s2.conf

a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1 c2
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.ipinyou.flume.interceptor.TypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.error = c1
a1.sources.r1.selector.mapping.normal = c2

a1.channels = c1 c2
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 10000
a1.channels.c2.byteCapacityBufferPercentage = 20
a1.channels.c2.byteCapacity = 800000


a1.sinks = k1 k2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 7771
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 7772

flume-console1.conf

a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 7771

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

flume-console2.conf

a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 7772

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

启动

# 依次启动在: hadoop103 hadoop104 hadoop102
bin/flume-ng agent -c conf -f jobs/flume-console1.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/flume-console2.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1

9. Flume 自定义Source

  • 编码实现
  1. 自定义类 继承 AbstractSource ,实现 Configurable, PollableSource
  2. 实现 configure():读取配置文件
  3. 实现 process():接收外部数据,封装Event,写入Channel
  • 打包到lib下
  • 编写配置文件

source type : 全类名

  • 启动

10. Flume 自定义Sink

  • 编码实现
  1. 自定义类 继承 AbstractSink,实现Configurable

  2. 实现 configure():读取配置文件

  3. 实现 process():接收Channel数据,开启事物,写入对应位置

  • 后续和上述一致

11. Flume 监控

借助 Ganglia 第三方开源工具

Ganglia:web 展示数据、gmetad 存储数据、gmod 收集数据

11.1 Ganglia 安装

  1. 安装
# 102 103 104
yum install -y epel-release
# 102
yum install -y ganglia-gmetad
yum install -y ganglia-web
yum install -y ganglia-gmod
# 103 104
yum install -y ganglia-gmod
  1. 修改配置文件

/etc/httpd/conf.d/ganglia.conf

# 在 Location 下 配置WindowsIP
Require ip 192.168.xxx.xxx

/etc/ganglia/gmetad.conf

data_source "my cluster" hadoop102

/etc/ganglia/gmod.conf : hadoop102 103 104 分发

# 修改下列配置
name = "my cluster"
host = hadoop102
bind = 0.0.0.0

关闭 selinux: /etc/selinux/config ,重启才能生效或临时生效

SELINUX=disabled
# 临时生效
setenforce 0

11.2 Ganlia 启动

# 如果权限不足,则修改权限
chmod -R 777 /var/lib/ganglia
# hadoop102
systemctl start gmond
systemctl start httpd
systemctl start gmetad

# hadoop103 hadoop104
systemctl start gmond

浏览器打开Web UI:

http://hadoop102/ganglia

11.3 Flume 启动

bin/flume-ng agent -n a1 -c conf -f jobs/xxx
-Dflume.root.logger=INFO,console
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=hadoop102:8649
上一篇:添加验证


下一篇:TFRS之特征预处理