文章目录
资源下载:
IntelliJ IDEA Community Edition 2021.2.3:https://www.jetbrains.com/idea/download/other.html
Maven 3.5.4:https://archive.apache.org/dist/maven/maven-3/
Spark 2.4.2:http://archive.apache.org/dist/spark/
Scala 2.12.2:https://www.scala-lang.org/download/all.html
Hadoop 2.6.0:https://archive.apache.org/dist/hadoop/common/
一、新建项目
- New Project 新建项目
- 选择Maven工程,Next
-
自定义项目名称,及项目存储的位置,Finish即可完成新项目的创建
二、配置Maven环境
- File下找到setting,在搜索栏中查找maven
-
修改
Maven home path
、User settings file
、Local repository
三个文件路径(要勾中Override),修改完后Applay即可
-
配置完Maven后,打开右侧的Maven工具栏,如果此处有报错(红色波浪线),点击循环按钮刷新一下,等待几分钟即可;如果还是有报错,可能Mavend的配置存在问题,进行复查
- 如下是pom.xml依赖配置文件,可以自行复制
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>Spark</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<scala.version>2.12.2</scala.version>
<spark.version>2.4.2</spark.version>
<hadoop.version>2.6.0</hadoop.version>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<!--阿里云镜像-->
<repositories>
<repository>
<id>nexus-aliyun</id>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</repository>
</repositories>
<dependencies>
<!-- 导入scala的依赖 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- 导入spark core的依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- 导入spark sql的依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- hadoop相关依赖 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- 导入hadoop-hdfs的依赖 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- 导入日志的依赖 -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.12</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
-
导入依赖配置文件后,需要点击如图所示的图标进行依赖的导入下载(所有的依赖资源将会下载到Maven配置中的Local repository目录中),下载完成后可以再次刷新,不再报错即可
三、配置Scala环境
-
对于scala语言的支持,ideal中需要借助插件
Scala
,在setting中打开plugins,搜索scala下载即可,下载完成后需要重启ideal
-
重启完成后,右击项目选择
Add Framework Support
,添加项目支持框架
-
选择scala,在右侧点击Create,弹框中可以选择Download在线下载,也可以选择Browse…,浏览找到自己下载好的scala
-
选择好之后,ok即可完成Scala的环境配置
返回顶部
四、测试准备
-
在项目中如图所示位置创建新的scala文件夹,test中同样创建scala文件夹,右击
Mark Directory as
,将其分别注为Sources Root
、Test Sources Root
- 标注完成后,右击scala文件夹创建Scala Class文件进行编程测试
- 上半部分图中没有出现Scala Class的原因有:
没有配置scala环境
、没有对scala文件进行标注
五、词频统计测试
- 右击创建WordCount.scala文件
package wordCount
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]): Unit = {
// 创建sparkContext --- 本地模式运行
val conf = new SparkConf().setAppName("wordCount").setMaster("local[6]")
val sc = new SparkContext(conf)
// 加载文件
val context: RDD[String] = sc.textFile("G:\\Projects\\IdealProject-C21\\projects\\src\\main\\scala\\wordCount\\test.txt")
// 数据处理
val split: RDD[String] = context.flatMap(item => item.split(" "))
val count: RDD[(String, Int)] = split.map(item => (item, 1))
val reduce = count.reduceByKey((curr, agg) => curr + agg)
val result = reduce.collect()
result.foreach(println(_))
}
}