spark-shell 启动设置动态分区,snappy压缩、parquet存储以及备份

1、spark-shell 启动设置动态分区

  --executor-memory 16G \
  --total-executor-cores 10 \ 
  --executor-cores 10 \
  --conf "spark.hadoop.hive.exec.dynamic.partition=true" \
  --conf "spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict" 
  --conf spark.sql.shuffle.partitions=10 \
  --conf spark.default.parallelism=10 \

2、spark-sql对表压缩及备份

val sqlContext = new org.apache.spark.SQLContext(sc);
import org.apache.hadoop.conf.Configuration
import org.apache.fs.{FileSystem, FileUtil, Path ,FileStatus}
import scala.collection.mutable.{ArrayBuffer, ListBuffer}
import scala.io.Source
import java.io.PrintWrite

val tbn = "src_es"
val tbn = Array("middata","decision_info")

for (tb <- tbn){
    println(dbn+"."+tb)
    val df = sqlContext.sql("select * from "+dbn+"."+tb)
    df.write.option("compression","snappy").format("parquet")
    .save("/backupdatafile/"+dbn+".db/"+tb)
    val dbtb = spark.read.parquet("/backupdatafile/"+dbn+".db/"+tb)
    dbtb.createOrReplaceTempView("test_"+tb)
    spark.sql("insert overwrite table "+dbn+"."+tb+" select * from test_"+tb);
}
上一篇:impala + kudu | 大数据实时计算踩坑优化指南


下一篇:spark 写 parquet 文件到 hdfs 上、以及到本地