MR案例：小文件合并SequeceFile

2022-07-10 07:59:50

SequeceFile是Hadoop API提供的一种二进制文件支持。这种二进制文件直接将<key, value>对序列化到文件中。可以使用这种文件对小文件合并，即将文件名作为key，文件内容作为value序列化到大文件中。这种文件格式有以下好处：

1). 支持压缩，且可定制为基于Record或Block压缩（Block级压缩性能较优）
2). 本地化任务支持：因为文件可以被切分，因此MapReduce任务时数据的本地化情况应该是非常好的。
3). 难度低：因为是Hadoop框架提供的API，业务逻辑侧的修改比较简单。

坏处：是需要一个合并文件的过程，且合并后的文件将不方便查看。

package test0820;

import java.io.IOException;

import java.io.InputStream;

import java.net.URI;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IOUtils;

import org.apache.hadoop.io.SequenceFile;

import org.apache.hadoop.io.Text;

public class TestSF {

    public static void main(String[] args) throws IOException, Exception{

        Configuration conf = new Configuration();

        FileSystem fs = FileSystem.get(new URI("hdfs://10.16.17.182:9000"), conf);

        //输入路径：文件夹
        FileStatus[] files = fs.listStatus(new Path(args[0]));

        Text key = new Text();

        Text value = new Text();

        //输出路径：文件
        SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new Path(args[1]),key.getClass() , value.getClass());

        InputStream in = null;

        byte[] buffer = null;

        for(int i=0;i<files.length;i++){

            key.set(files[i].getPath().getName());

            in = fs.open(files[i].getPath());

            buffer = new byte[(int) files[i].getLen()];

            IOUtils.readFully(in, buffer, 0, buffer.length);

            value.set(buffer);

            IOUtils.closeStream(in);

            System.out.println(key.toString()+"\n"+value.toString());

            writer.append(key, value);

        }    

        IOUtils.closeStream(writer);

    }

}

注意，待完善的地方：以Block方式压缩。

码农公寓

相关文章