大数据:手写MR(-)

往往从大数据开始,第一个就是手写MR

MR是map-reduce,是hadoop的核心的组件之一,并发执行,主要来处理hdfs分布式文件系统

介绍自己手写的wordcount,然后再进行原理解释

1.下载hadoo安装到windows本地 

    地址 https://archive.apache.org/dist/hadoop/core/hadoop-2.7.2/hadoop-2.7.2.tar.gz

2. 解压之后进行设置环境变量

     新建  HADOOP_HOME     D:\hadoop-2.7.2

     Path中增加   %HADOOP_HOME%\bin 和   %HADOOP_HOME%\sbin
 

3.安装好JDK和IDEA 社区版

4. 建立maven 项目

5.引入几个依赖包

<dependencies>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.7.2
</version>
    </dependency>
 
</dependencies>

6.map,reducer drvier写好以后会出现空指针异常

原因是hadoop.dll文件和winutils.exe文件缺失了。解决步骤为:


1,下载这两个文件,下载地址:http://download.csdn.net/download/fly_leopard/9503059

2,解压之后,把hadoop.dll拷贝到C:\Windows\System32下面

3,创建环境变量HADOOP_HOME,然后把winutils.exe文件拷贝到${HADOOP_HOME}/bin目录下
地址:链接:https://pan.baidu.com/s/1g75yEqOaZtljZrfdssDZ5w 
提取码:andy 
复制这段内容后打开百度网盘手机App,操作更方便哦

7.运行后,可正常统计

mapper:

package cn.andy.mr;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
/**
 * @author AndyShi
 * @version 1.0
 * @date 2021/4/2 0002 22:04
 */
public class WCMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
    //实现父类的快捷键 alt+insert(INS)
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //获取一行数据
        String line = value.toString();
        //切分数据,按照空格切分
        String[] fields = line.split(" ");
        //遍历获取每个单词
        for (String field : fields){
            //输出,每个单词拼接 1(标记) (java(K) 1(V))
            context.write(new Text(field),new IntWritable(1));
        }

    }
}

 

reducer

package cn.andy.mr;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
/**
 * @author AndyShi
 * @version 1.0
 * @date 2021/4/2 0002 22:04
 */
public class WCReducer extends Reducer<Text, IntWritable,Text, IntWritable> {
    //实现父类方法快捷键  ctrl+o
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        //定义一个计数器
        int count = 0;
        //累加计数
        for (IntWritable intWritable : values){
            //intWritable转化成int
            count+=intWritable.get();
        }
        //输出
        context.write(key,new IntWritable(count));
    }
}

 

Driver

package cn.andy.mr;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


import java.io.IOException;
/**
 * @author AndyShi
 * @version 1.0
 * @date 2021/4/2 0002 22:04
 */

public class WCDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //实例化配置文件
        Configuration configuration = new Configuration();
        //定义一个job任务
        Job job = Job.getInstance(configuration);

        //配置job的信息
        job.setJarByClass(WCDriver.class);
        //指定自定义的mapper类以及mapper的输出数据类型到job
        job.setMapperClass(WCMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //指定自定义的reduce以及reduce的输出数据类型(总输出的类型)到job
        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileSystem fileSystem = FileSystem.get(configuration);
        Path outputPath = new Path("wordcount/output/20210402");
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath,true);
        }

        //配置输入数据的路径
        FileInputFormat.setInputPaths(job,new Path("wordcount/input/wordcount.txt"));

        //配置输出数据的路径
        FileOutputFormat.setOutputPath(job,new Path("wordcount/output/20210402"));


        //提交任务
        job.waitForCompletion(true);
    }
}

参考网上的教程在本地手写的MR,下一篇开始一步步剖析原理

 

上一篇:数据如何从HBase读到MR


下一篇:深度学习在医学影像中的研究进展及发展趋势