IDEA中使用SparkSQL读写Hive
添加依赖
libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "2.2.0",
"org.apache.spark" %% "spark-sql" % "2.2.0",
"org.apache.spark" %% "spark-sql" % "2.2.0",
"com.databricks" %% "spark-csv" % "1.5.0",
"org.apache.spark" %% "spark-hive" % "2.2.0",
"mysql" % "mysql-connector-java" % "8.0.15")
复制hive-site.xml
复制hive-site.xml文件放在IDEA工程下的resources目录下,记录了元数据的地址,如果不添加这个配置,SparkSQL将走默认的配置路线,在项目本地创建一个spark-warehouse文件夹来放置我们的元数据
代码实例
package doc
import java.io.File
import org.apache.spark.sql.{SaveMode, SparkSession}
/**
* @Author huangwei
* @Date 19-9-20
* @Comments Spark SQL 读写Hive
**/
object HiveTable {
def main(args: Array[String]): Unit = {
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive")
.master("local")
.config("spark.sql.warehouse.dir",warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.sql
sql("SHOW DATABASES").show()
// +------------+
// |databaseName|
// +------------+
// | default|
// | test|
// +------------+
// 选择数据库
sql("USE test")
// 执行查询
sql("SELECT * FROM score").show()
// +----------+------------+-------+----+-------+
// |student_id|student_name|chinese|math|english|
// +----------+------------+-------+----+-------+
// | 1| ZhaoSi| 88| 98| 90|
// | 2| ZhangSan| 80| 89| 78|
// | 3| WangMaZi| 65| 70| 79|
// | 4| LiuNeng| 78| 80| 82|
// | 5| LiSi| 85| 90| 72|
// +----------+------------+-------+----+-------+
// 创建表格
sql("CREATE TABLE IF NOT EXISTS persons (name STRING,gender String,age INT,from STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
// 导入数据
sql("LOAD DATA LOCAL INPATH '/home/huangwei/person.txt' INTO TABLE persons")
sql("SELECT * FROM persons").show()
// +-------+-------+---+-----+
// |name |gender|age| from|
// +----+------+---+---------+
// | 马云| 男| 55|浙江-杭州|
// | 马化腾| 男| 48|广东-深圳|
// | 李彦宏| 男| 51|山西-阳泉|
// | 刘强东| 男| 46|江苏-宿迁|
// | 雷军| 男| 50|湖北-仙桃|
// +-------+-------+---+------+
val result = sql("SELECT * FROM persons WHERE age < 50")
// 查询结果导出
result.repartition(1).write.format("csv").mode(SaveMode.Append)save("/home/huangwei/result")
}
}
踩的一些坑
异常处理:File /hive/tmp does not exist
找不到/hive/tmp目录(对应于hive-site.xml中),但是HDFS上是存在该目录的,这里需要修改hive-site.xml将/hive/tmp修改为hdfs://master:9000/hive/tmp.因为IDEA中是默认系统的文件系统路径,所以会存在找不到文件
解决方案:修改hive-site.xml配置
<property>
<name>hive.exec.scratchdir</name>
<value>hdfs://localhost:9000/hive/tmp</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs:localhist:9000/hive/warehouse</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>hdfs:localhost:9000/hive/log</value>
</property>
异常处理:Hive Schema version 2.3.0 does not match metastore’s schema version 1.2.0 Metastore is not
解决方案:修改hive-site.xml 关闭版本验证
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
异常处理:java.lang.RuntimeException: java.lang.RuntimeException: Unable to create log directory /usr/local/hive2.3.6/tmp/huangwei;
本地tmp由于是在/目录下没有写权限
解决方案:修改hive-site.xml
<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/huangwei/tmp</value>
</property>
还有一个警告,Loading class com.mysql.jdbc.Driver'. This is deprecated. The new driver class is
com.mysql.cj.jdbc.Driver’. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
因为这里mysql的驱动我用的是MySQL8版本的,所以需要修改hive-site.xml,将com.mysql.cj.jdbc.Driver改为com.mysql.cj.jdbc.Driver
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>