交互式查询?具Impala

2023-11-14 22:02:16

Impala是什么：

　　Impala是Cloudera提供的?款开源的针对HDFS和HBASE中的PB级别数据进?交互式实时查询(Impala 速度快)，Impala是参照?歌的新三篇论?当中的Dremel实现?来，其中旧三篇论?分别是（BigTable，GFS，MapReduce）分别对应我们即将学的HBase和已经学过的HDFS以及MapReduce。

　　 Impala最?卖点和最?特点就是快速，Impala中?翻译是??羚?。

Impala优势：

　　之前学习的Hive以及MR适合离线批处理，但是对交互式查询的场景?能为?(要求快速响应)，所以为了解决查询速度的问题，Cloudera公司依据Google的Dremel开发了Impala,Impala抛弃了MapReduce 使?了类似于传统的MPP数据库技术，??提?了查询的速度。

MPP是什么？

　　MPP (Massively Parallel Processing)，就是?规模并?处理，在MPP集群中，每个节点资源都是独? 享有也就是有独?的磁盘和内存，每个节点通过?络互相连接，彼此协同计算，作为整体提供数据服务。

Impala 优势：

Impala没有采取MapReduce作为计算引擎，MR是?常好的分布式并?计算框架，但MR引擎更多的是?向批处理模式，?不是?向交互式的SQL执?。与 Hive相?：Impala把整个查询任务转为 ?棵执?计划树，?不是?连串的MR任务，在分发执?计划后，Impala使?拉取的?式获取上个阶段的执?结果，把结果数据、按执?树流式传递汇集，减少的了把中间结果写?磁盘的步骤，再从磁盘读取数据的开销。Impala使?服务的?式避免每次执?查询都需要启动的开销，即相? Hive没了MR启动时间。
使?LLVM(C++编写的编译器)产?运?代码，针对特定查询?成特定代码。
优秀的IO调度，Impala?持直接数据块读取和本地代码计算。
选择适合的数据存储格式可以得到最好的性能（Impala?持多种存储格式）。
尽可能使?内存，中间结果不写磁盘，及时通过?络以stream的?式传递。

Impala与Hive对?分析：

查询过程

Hive：在Hive中，每个查询都有?个“冷启动”的常?问题。（map,reduce每次都要启动关闭，申请资源，释放资源。。。）
Impala：Impala避免了任何可能的启动开销，这是?种本地查询语?。因为要始终处理查询，则 Impala守护程序进程总是在集群启动之后就准备就绪。守护进程在集群启动之后可以接收查询任务并执?查询任务。

中间结果

Hive：Hive通过MR引擎实现所有中间结果，中间结果需要落盘，这对降低数据处理速度有不利影响。
Impala：在执?程序之间使?流的?式传输中间结果，避免数据落盘。尽可能使?内存避免磁盘开销

交互查询

Hive：对于交互式计算，Hive不是理想的选择。
Impala：对于交互式计算，Impala?常适合。(数据量级PB级)

计算引擎

Hive：是基于批处理的Hadoop MapReduce
Impala：更像是MPP数据库

容错

Hive：Hive是容错的（通过MR&Yarn实现）
Impala：Impala没有容错，由于良好的查询性能，Impala遇到错误会重新执??次查询

查询速度

Impala：Impala?Hive快3-90倍。

Impala优势总结

1. Impala最?优点就是查询速度快，在?定数据量下；
2. 速度快的原因：避免了MR引擎的弊端，采?了MPP数据库技术

元数据更新：

因为impala 不能自动感知 hive对元数据的更新操作。

更新所有元数据，?动执?invalidate metadata；
更新某一个表的元数据，refresh dbname.tablename

impala架构图：

如果是大表join ，impala使用hash join，使得hash 值一样的 id去往同一节点，这样不同节点可以并行执行join操作。

如果是小表，impala使用广播 join。

group by 操作： impala 会对分组字段进行hash 分发，这样不同节点可以并行执行局部group by 操作，最终merge所有节点的结果。

jdbc连接 impala：

　　impala的sql语法与hive基本一样，支持大部分的hive内置函数。

　　impala的命令行是impala-shell

　　关于impala的相关配置参考word 文档。

<dependencies>
 <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoopcommon -->
 <dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-common</artifactId>
 <version>2.9.2</version>
 </dependency>
 <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-common --
>
 <dependency>
 <groupId>org.apache.hive</groupId>
 <artifactId>hive-common</artifactId>
 <version>2.3.7</version>
 </dependency>
 <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-metastore
-->
 <dependency>
 <groupId>org.apache.hive</groupId>
 <artifactId>hive-metastore</artifactId>
 <version>2.3.7</version>
 </dependency>
 <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-service -
->
 <dependency>
 <groupId>org.apache.hive</groupId>
 <artifactId>hive-service</artifactId>
 <version>2.3.7</version>
 </dependency>

 <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
 <dependency>
 <groupId>org.apache.hive</groupId>
 <artifactId>hive-jdbc</artifactId>
 <version>2.3.7</version>
 
 <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
 <dependency>
 <groupId>org.apache.hive</groupId>
 <artifactId>hive-exec</artifactId>
 <version>2.3.7</version>
 </dependency>
 </dependencies>

package com.lagou.impala.jdbc;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
public class ImpalaTest {
 public static void main(String[] args) throws Exception {
 //定义连接impala的驱动和连接url
 String driver = "org.apache.hive.jdbc.HiveDriver";
 String driverUrl = "jdbc:hive2://linux122:21050/default;auth=noSasl";
 //查询的sql语句
 String querySql = "select * from t1";
 //获取连接
 Class.forName(driver);
 //通过Drivermanager获取连接
 final Connection connection = DriverManager.getConnection(driverUrl);
 final PreparedStatement ps = connection.prepareStatement(querySql);
 //执?查询
 final ResultSet resultSet = ps.executeQuery();
 //解析返回结果
 //获取到每条数据的列数
 final int columnCount = resultSet.getMetaData().getColumnCount();
 //遍历结果集
while (resultSet.next()) {
 for (int i = 1; i <= columnCount; i++) {
 final String string = resultSet.getString(i);
 System.out.print(string + "\t");
 }
 System.out.println();
 }
 //关闭资源
 ps.close();
 connection.close();
 }
}