使用spark操作kudu

使用spark操作kudu

Spark与KUDU集成支持:

  • DDL操作(创建/删除)

  • 本地Kudu RDD

  • Native Kudu数据源,用于DataFrame集成

  • 从kudu读取数据

  • 从Kudu执行插入/更新/ upsert /删除

  • 谓词下推

  • Kudu和Spark SQL之间的模式映射

    到目前为止,我们已经听说过几个上下文,例如SparkContext,SQLContext,HiveContext,SparkSession,现在,我们将使用Kudu引入一个KuduContext。这是可在Spark应用程序中广播的主要可序列化对象。此类代表在Spark执行程序中与Kudu Java客户端进行交互。

    KuduContext提供执行DDL操作所需的方法,与本机Kudu RDD的接口,对数据执行更新/插入/删除,将数据类型从Kudu转换为Spark等。

    比较常见的操作:

// Create a Spark and SQL context
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc) // Comma-separated list of Kudu masters with port numbers
val master1 = "ip-10-13-4-249.ec2.internal:7051"
val master2 = "ip-10-13-5-150.ec2.internal:7051"
val master3 = "ip-10-13-5-56.ec2.internal:7051"
val kuduMasters = Seq(master1, master2, master3).mkString(",") // Create an instance of a KuduContext
val kuduContext = new KuduContext(kuduMasters)

Maven导包

 <repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories> <dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-client -->
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>1.6.0-cdh5.14.0</version>
<scope>test</scope>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-client-tools -->
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client-tools</artifactId>
<version>1.6.0-cdh5.14.0</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2 -->
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.6.0-cdh5.14.0</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>

具体详细代码看下一章介绍

上一篇:Unity Shader笔记


下一篇:mysql 5.7 使用主键约束