android -- WatchDog看门狗分析

2022-10-31 19:56:26

在由单片机构成的微型计算机系统中,由于单片机的工作常常会受到来自外界电磁场的干扰,造成程序的跑飞,而陷入死循环,程序的正常运行被打断,由单片机控制的系统无法继续工作,会造成整个系统的陷入停滞状态,发生不可预料的后果,所以出于对单片机运行状态进行实时监测的考虑,便产生了一种专门用于监测单片机程序运行状态的芯片,俗称"看门狗"。

在Android系统中也需要看好几个重要的Service门，用于发现出了问题的Service杀掉SystemServer进程，所以有必要了解并分析其系统问题。

那么被监控的有哪些Service呢？

ActivityManagerService.java :frameworks\base\services\java\com\android\server\am

PowerManagerService.java :frameworks\base\services\java\com\android\server

WindowManagerService.java :frameworks\base\services\java\com\android\server

下面就依次分析一下其整个处理流程：

1、初始化

run @ SysemServer.java

   Slog.i(TAG, "Init Watchdog");

   Watchdog.getInstance().init(context, battery, power, alarm,

   ActivityManagerService.self());

这里使用单例模式创建：

   public static Watchdog getInstance() {

   if (sWatchdog == null) {

   sWatchdog = new Watchdog();

   }

   return sWatchdog;

   }

   public void init(Context context, BatteryService battery,

   PowerManagerService power, AlarmManagerService alarm,

   ActivityManagerService activity) {

   // 上下文环境变量

   mResolver = context.getContentResolver();

   mBattery = battery;

   mPower = power;

   mAlarm = alarm;

   mActivity = activity;

// 登记 RebootReceiver() 接收，用于reboot广播接收使用

context.registerReceiver(new RebootReceiver(),

new IntentFilter(REBOOT_ACTION));

...

// 系统启动时间

mBootTime = System.currentTimeMillis();

   }

ok,调用init函数启动完毕

2、运行中

run @ SysemServer.java

调用 Watchdog.getInstance().start(); 启动看门狗

首先看下 Watchdog 类定义：

/** This class calls its monitor every minute. Killing this process if they don't return **/

public class Watchdog extends Thread {

}

从线程类中继承，即会在一个单独线程中运行，调用thrrad.start()即调用 Watchdog.java 中的 run() 函数

   public void run() {

   boolean waitedHalf = false;

   while (true) {

   mCompleted = false;



   // 1、给mHandler发送 MONITOR 消息，用于请求检查 Service是否工作正常

   mHandler.sendEmptyMessage(MONITOR);

   synchronized (this) {

// 2、进行 wait 等待 timeout 时间确认是否退出循环

long timeout = TIME_TO_WAIT;

   // NOTE: We use uptimeMillis() here because we do not want to increment the time we

   // wait while asleep. If the device is asleep then the thing that we are waiting

   // to timeout on is asleep as well and won't have a chance to run, causing a false

   // positive on when to kill things.

   long start = SystemClock.uptimeMillis();

   while (timeout > 0 && !mForceKillSystem) {

   try {

   wait(timeout); // notifyAll() is called when mForceKillSystem is set

   } catch (InterruptedException e) {

   Log.wtf(TAG, e);

   }

   timeout = TIME_TO_WAIT - (SystemClock.uptimeMillis() - start);

   }

// 3、如果 mCompleted 为真表示service一切正常，后面会再讲到

if (mCompleted && !mForceKillSystem) {

   // The monitors have returned.

   waitedHalf = false;

   continue;

   }

// 4、表明检测到了有 deadlock-detection 条件发生，利用 dumpStackTraces 打印堆栈依信息

if (!waitedHalf) {

   // We've waited half the deadlock-detection interval. Pull a stack

   // trace and wait another half.

   ArrayList<Integer> pids = new ArrayList<Integer>();

   pids.add(Process.myPid());

   ActivityManagerService.dumpStackTraces(true, pids, null, null);

   waitedHalf = true;

   continue; // 不过这里会再次检测一次

   }

}

SystemClock.sleep(2000);



   // 5、打印内核栈调用关系

   // Pull our own kernel thread stacks as well if we're configured for that

   if (RECORD_KERNEL_THREADS) {

   dumpKernelStackTraces();

   }

// 6、ok,系统出问题了，检测到某个 Service 出现死锁情况，杀死SystemServer进程

// Only kill the process if the debugger is not attached.

   if (!Debug.isDebuggerConnected()) {

   Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + name);

   Process.killProcess(Process.myPid());

   System.exit(10);

   } else {

   Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");

   }

   waitedHalf = false;

   }

   }

主要工作逻辑：监控线程每隔一段时间发送一条 MONITOR 线另外一个线程，另个一个线程会检查各个 Service 是否正常运行，看门狗就不停的检查并等待结果，失败则杀死SystemServer.

3、Service 检查线程

   /**

   * Used for scheduling monitor callbacks and checking memory usage.

   */

   final class HeartbeatHandler extends Handler {

@Override

   public void handleMessage(Message msg) { // Looper 消息处理函数

   switch (msg.what) {



   case MONITOR: {

// 依次检测各个服务，即调用 monitor() 函数

final int size = mMonitors.size();

for (int i = 0 ; i < size ; i++) {

   mCurrentMonitor = mMonitors.get(i);

   mCurrentMonitor.monitor();

   }

// 检测成功则设置 mCompleted 变量为 true

synchronized (Watchdog.this) {

   mCompleted = true;

   mCurrentMonitor = null;

   }

下面我们来看一下各个Service如何确定自已运行ok呢？以 ActivityManagerService 为例：

首先加入检查队列：

private ActivityManagerService() {

   // Add ourself to the Watchdog monitors.

   Watchdog.getInstance().addMonitor(this);

}

然后实现 monitor() 函数：

   /** In this method we try to acquire our lock to make sure that we have not deadlocked */

   public void monitor() {

   synchronized (this) { }

   }

明白了吧，其实就是检查这个 Service 是否发生了死锁，对于此情况就只能kill SystemServer系统了。对于死锁的产生原因非常多，但有个情况需要注意：java层死锁可能发生在调用native函数，而native函数可能与硬件交互导致时间过长而没有返回，从而导致长时间占用导致问题。

4、内存使用检测

消息发送

final class GlobalPssCollected implements Runnable {

public void run() {

   mHandler.sendEmptyMessage(GLOBAL_PSS);

   }

   }



   检测内存处理函数：

   final class HeartbeatHandler extends Handler {

   @Override

   public void handleMessage(Message msg) {

   switch (msg.what) {

   case GLOBAL_PSS: {

   if (mHaveGlobalPss) {

   // During the last pass we collected pss information, so

   // now it is time to report it.

   mHaveGlobalPss = false;

   if (localLOGV) Slog.v(TAG, "Received global pss, logging.");

logGlobalMemory();

}

   } break;





   其主要功能如下,统计pSS状况及读取相关linux内核中内存信息：

   void logGlobalMemory() {

   mActivity.collectPss(stats);



   Process.readProcLines("/proc/meminfo",
mMemInfoFields, mMemInfoSizes);



   Process.readProcLines("/proc/vmstat",
mVMStatFields, mVMStatSizes);

}

码农公寓

相关文章