Spark2.2(三十九):如何根据appName监控spark任务,当任务不存在则启动(任务存在当超过多久没有活动状态则kill,等待下次启动)

业务需求

实现一个根据spark任务的appName来监控任务是否存在,及任务是否卡死的监控。

1)给定一个appName,根据appName从yarn application -list中验证任务是否存在,不存在则调用spark-submit.sh脚本来启动任务;

2)如果任务存在yarn application -list中,则读取‘监控文件(监控文件内容包括:appId,最新活动时间)’,从监控文件中读取出最后活动的日期,计算当前日期与app的最后活动日期相差时间为X,如果X大于30minutes(认为任务处于假死状态[再发布环境发现有的任务DAG抛出OOM,导致app的executor和driver依然存在,当时不执行任务调度,程序卡死。具体错误详情请参考《https://issues.apache.org/jira/browse/SPARK-26452》]),则执行yarn application -kill appId(杀掉任务),等待下次监控脚本执行时重启任务。

监控实现

脚本

#/bin/sh
#LANG=zh_CN.utf8
#export LANG
export SPARK_KAFKA_VERSION=0.10
export LANG=zh_CN.UTF-
# export env variable
if [ -f ~/.bash_profile ];
then
source ~/.bash_profile
fi
source /etc/profile myAppName='myBatchTopic' #这里指定需要监控的spark任务的appName,注意:这名字重复了会导致监控失败。
apps='' for app in `yarn application -list`
do
apps=${app},$apps
done
apps=${apps%?} if [[ $apps =~ $myAppName ]];
then
echo "appName($myAppName) exists in yarn application list"
#)运行 hadop fs -cat /目录/appName,读取其中最后更新日期;(如果文件不存在,则跳过等待文件生成。)
monitorInfo=$(hadoop fs -cat /user/dx/streaming/monitor/${myAppName})
LD_IFS="$IFS"
IFS=","
array=($monitorInfo)
IFS="$OLD_IFS"
appId=${array[]}
monitorLastDate=${array[]}
echo "loading mintor information 'appId:$appId,monitorLastUpdateDate:$monitorLastDate'" current_date=$(date "+%Y-%m-%d %H:%M:%S")
echo "loading current date '$current_date'" #)与当前日期对比:
# 如果距离当前日期相差小于30min,则不做处理;
# 如果大于30min则kill job,根据上边yarn application -list中能获取对应的appId,运行yarn application -kill appId
t1=`date -d "$current_date" +%s`
t2=`date -d "$monitorLastDate" +%s`
diff_minute=$(($(($t1-$t2))/))
echo "current date($current_date) over than monitorLastDate($monitorLastDate) $diff_minute minutes"
if [ $diff_minute -gt ];
then
echo 'over then 30 minutes'
$(yarn application -kill ${appId})
echo "kill application ${appId}"
else
echo 'less than 30 minutes'
fi
else
echo "appName($myAppName) not exists in yarn application list"
#./submit_x1_x2.sh abc TestRestartDriver #这里指定需要启动的脚本来启动相关任务
$(nohup ./submit_checkpoint2.sh >> ./output.log >& &)
fi

监控脚本业务流程图:

Spark2.2(三十九):如何根据appName监控spark任务,当任务不存在则启动(任务存在当超过多久没有活动状态则kill,等待下次启动)

监控文件生成

我这里程序是spark structured streaming,因此可以注册sparkSesssion的streams()的query的监听事件

sparkSession.streams().addListener(new GlobalStreamingQueryListener(sparkSession。。。))

在监听事件中实现如下:

public class GlobalStreamingQueryListener extends StreamingQueryListener {
private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(GlobalStreamingQueryListener.class);
private static final String monitorBaseDir = "/user/dx/streaming/monitor/";
private SparkSession sparkSession = null;
private LongAccumulator triggerAccumulator = null; public GlobalStreamingQueryListener(SparkSession sparkSession, LongAccumulator triggerAccumulator) {
this.sparkSession = sparkSession;
this.triggerAccumulator = triggerAccumulator;
} @Override
public void onQueryStarted(QueryStartedEvent queryStarted) {
System.out.println("Query started: " + queryStarted.id());
} @Override
public void onQueryTerminated(QueryTerminatedEvent queryTerminated) {
System.out.println("Query terminated: " + queryTerminated.id());
} @Override
public void onQueryProgress(QueryProgressEvent queryProgress) {
System.out.println("Query made progress: " + queryProgress.progress());
// sparkSession.sql("select * from " +
// queryProgress.progress().name()).show(); triggerAccumulator.add(1);
System.out.println("Trigger accumulator value: " + triggerAccumulator.value()); logger.info("minitor start .... ");
try {
if (HDFSUtility.createDir(monitorBaseDir)) {
logger.info("Create monitor base dir(" + monitorBaseDir + ") success");
} else {
logger.info("Create monitor base dir(" + monitorBaseDir + ") fail");
}
} catch (IOException e) {
logger.error("An error was thrown while create monitor base dir(" + monitorBaseDir + ")");
e.printStackTrace();
} // spark.app.id application_1543820999543_0193
String appId = this.sparkSession.conf().get("spark.app.id");
// spark.app.name myBatchTopic
String appName = this.sparkSession.conf().get("spark.app.name");
String mintorFilePath = (monitorBaseDir.endsWith(File.separator) ? monitorBaseDir : monitorBaseDir + File.separator) + appName;
logger.info("The application's id is " + appId);
logger.info("The application's name is " + appName);
logger.warn("If the appName is not unique,it will result in a monitor error"); try {
HDFSUtility.overwriter(mintorFilePath, appId + "," + new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new Date()));
} catch (IOException e) {
logger.error("An error was thrown while write info to monitor file(" + mintorFilePath + ")");
e.printStackTrace();
}
logger.info("minitor stop .... ");
} }
HDFSUtility.java中方法如下:
public class HDFSUtility {
private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HDFSUtility.class); /**
* 当目录不存在时,创建目录。
*
* @param dirPath
* 目标目录
* @return true-創建成功;false-失敗。
* @throws IOException
* */
public static boolean createDir(String dirPath) throws IOException {
FileSystem fs = null;
Path dir = new Path(dirPath);
boolean success = false;
try {
fs = FileSystem.get(new Configuration()); if (!fs.exists(dir)) {
success = fs.mkdirs(dir);
} else {
success = true;
}
} catch (IOException e) {
logger.error("create dir (" + dirPath + ") fail:", e);
throw e;
} finally {
try {
fs.close();
} catch (IOException e) {
e.printStackTrace();
}
} return success;
} /**
* 覆盖文件写入信息
*
* @param filePath
* 目标文件路径
* @param content
* 被写入内容
* @throws IOException
* */
public static void overwriter(String filePath, String content) throws IOException {
FileSystem fs = null;
// 在指定路径创建FSDataOutputStream。默认情况下会覆盖文件。
FSDataOutputStream outputStream = null;
Path file = new Path(filePath); try {
fs = FileSystem.get(new Configuration());
if (fs.exists(file)) {
System.out.println("File exists(" + filePath + ")");
}
outputStream = fs.create(file);
outputStream.write(content.getBytes());
} catch (IOException e) {
logger.error("write into file(" + filePath + ") fail:", e);
throw e;
} finally {
if (outputStream != null) {
try {
outputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
try {
fs.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
 
上一篇:137.https是否能够发送http的请求


下一篇:137.菜单悬停放大特效