1.spark核心RDD特点

2022-10-07 07:42:45

RDD(Resilient Distributed Dataset)

Spark源码：https://github.com/apache/spark

abstract class RDD[T: ClassTag](

@transient private var _sc: SparkContext,

@transient private var deps: Seq[Dependency[_]]

) extends Serializable with Logging

1.RDD是一个抽象类（不能直接使用，子类实现抽象方法后才能用）

2.带泛型的，可以支持多种类型：String、Person、User

RDD:Resilient Distributed Dataset 弹性分布式数据集

Represents an immutable,(不可变)

partitioned collection of elements （分区）

that can be operated on in parallel （并行计算）

Internally, each RDD is characterized by five main properties:

* - A list of partitions

* - A function for computing each split

* - A list of dependencies on other RDDs

rdd1=>rdd2=>rdd3

* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for

* an HDFS file) 数据在哪优先把作业调度到数据所在结点计算：移动数据不如移动计算

五大特性源码体现：

def compute(split: Partition, context: TaskContext): Iterator[T] 特性二

protected def getPartitions: Array[Partition] 特性一

protected def getDependencies: Seq[Dependency[_]] = deps 特性三

protected def getPreferredLocations(split: Partition): Seq[String] = Nil 特性五

val partitioner: Option[Partitioner] = None 特性四

码农公寓