Ambari 学习指南

2024-04-05 17:14:54

作用：

Hadoop 是用在商业主机网络集群上的大规模、分布式的数据存储和处理基础架构。监控和管理如此复杂的分布式系统是不简单的。为了管理这种复杂性，

Apache Ambari 从集群节点和服务收集了大量的信息，并把它们表现为容易使用的，集中化的接口：Ambari Web

功能：

显示诸如服务特定的摘要、图表以及警报信息

创建和管理 HDP 集群并执行基本的操作任务，例如启动和停止服务，向集群

中添加主机，以及更新服务配置

执行集群管理任务，例如启用 Kerberos 安全以及执行 Stack 升级

使用：

Dashboard（仪表盘）
使用集群仪表盘来监控 Hadoop 集群。通过单机 Ambari Web UI 主窗口顶端的 Dashboard 访问集群仪表盘。Ambari Web UI 显示仪表盘页作为主页。使用仪表盘来查看集群的操作状态。Ambari Web 左侧显示集群当前运行的 Hadoop 服务列表。仪表盘包括 Metrics, Heatmaps, 以及Config History 选项卡；默认显示 Metrics 选项卡。

1.1 Metrics在Metrics 页面上，有多个小程序(widget), 表现 HDP 集群服务的操作状态信息。多数小程序显示一个度量值(metric), 例如，HDFS Disk Usage 表示为一个负载图表和一个百分数指示。

HDFS：

       NameNode Heap       ：NameNode Java Virtual Machine (JVM) 堆内存使用的百分数。

       HDFS Disk Usage       ：分布式文件系统(DFS) 已使用的百分比，包括 DFS 和 non-DFS

       NameNode CPU WIO   ：CPU wait I/O 百分比

       Data Nodes Live       ：运转中的 DataNodes 的数量，由 NameNode 报告

       NameNode RPC       ：潜在 RPC 队列平均水平 (The average RPC queue latency)

       NameNode Uptime       ：NameNode 正常运行时间计算值(uptime calculation)

YARN：

       ResourceManager Heap   : 以使用的 ResourceManager JVM 堆内存百分比

       NodeManagers Live       ：运转中的 DataNodes 数量，由 ResourceManager 报告

       ResourceManager Uptime   ：ResourceManager uptime

       YARN Memory               ：可用的 YARN 内存百分数(used versus total available)

HBase：

       HBase Master Heap       : 已使用的 NameNode JVM 对内存百分数

       HBase Ave Load           ：HBase server 上的平均负载

       Region in Transition   ：转换中的 HBase regions 数量

       HBase Master Uptime       ：HBase master uptime

Storm：

Supervisors Live ：运转中的 supervisor 的数量，由 Nimbus Server 报告

Cluster-Wide：

Memory usage   : 集群范围的内存使用，包括缓存的(cached)，交换的(swapped), 使用的(used), 以及共享的(shared)

Network usage   : 集群范围的网络利用，包括输入和输出(including in-and-out)

CPU Usage       : 集群范围的 CPU 信息，包括系统的，用户的及 wait IO (including system, user and wait IO)

Cluster Load   : 集群范围负载信息，包括节点总数， CPU 总数，运行的进程数量，以及 1-min Load

1.2Heatmaps：评价指标可视化

如前所述，Ambari web 主页左侧被切分出一个状态摘要面板，并在顶部有 Metrics, Heatmaps, 和 Config History 选项卡，默认显示 Metrics 选项卡。

当要查看整个集群利用情况的图形表示时，单击 Heatmaps 选项卡，使用简单的颜色代码，称为 heatmap, 提供这类信息。

集群中每个主机表示为一个带颜色的块。将鼠标悬停在主机的颜色块上可以看到该主机更多的信息，在另一窗口上显示有关主机上安装的 HDP 组件的度量值。

在块中显示的颜色表示在一组选定的 metric 单元中的使用率。如果任何确定使用率的必要的数据不可用，这个块显示为 Invalid data. 通过修改 heatmap

默认的最大值解决这个问题，使用 Select Metric 菜单

1.3Config History：配置历史
Service
2.1 操作状态

Ambari Web 左侧的服务摘要列表列出了当前监控的所有 Apache 组件服务。图标的形状，颜色，以及每个条目左侧的动作指明了每个条目的操作状态：

实心绿 (solid green) | All masters are running

闪烁绿(blinking green) | Starting up

实心红 (solid red) | At least one master is down

闪烁红 (blinking red) | Stopping

2.2链接到服务 UI (Linking to Service UIs)

HDFS Links 和 HBase Links widgets 列出 HDP 组件用于链接到更多的 metric 信息，可用的线程栈，日志，以及纯组件 UI. 例如，可以为 HDFS 链接到

NameNode, Secondary NameNode, 和 DataNode 。

单击 More 下拉列表从每个服务可用的链接列表中选择。Ambari Dashboard 包括如下服务的度量的附加链接：

HDFS：

   NameNode UI       ：Links to the NameNode UI

   NameNode Logs   ：Links to the NameNode logs

   NameNode JMX    ：Links to the NameNode JMX servlet

   Thread Stacks    ：Links to the NameNode thread stack traces

HBase：

   HBase Master UI    ：Links to the HBase Master UI

   HBase Logs            ：Links to the HBase logs

   ZooKeeper Info        ：Links to ZooKeeper information

   HBase Master JMX    ：Links to the HBase Master JMX servlet

   Debug Dump            ：Links to debug information

Thread Stacks        ：Links to the HBase Master thread stack traces
Hosts
作为集群系统管理员或集群操作员，需要知道每部主机的操作状态。也需要知道哪部主机有问题需要处理。可以使用 Ambari Web Hosts 页面来管理多个Hortonworks Data Platform (HDP) 组件，例如运行在整个集群上 DataNodes, NameNodes, NodeManagers, 和 RegionServers. 举例来说，可以重启所有的DataNode 组件，可选地控制滚动重启任务。Ambari Hosts 可以过滤进行管理的主机组件选取，基于操作状态，主机健康状况，以及定义的主机分组。

3.1理解主机状态 (Understanding Host Status)

可以在 Ambari Web Hosts 页面查看集群上单个主机的状态。主机以 fully qualified domain name (FDQN)的形式列出，并附有一个带有颜色的图标指示出

主机的操作状态。

● 红色三角形   ：该主机上至少有一个 master 组件挂掉了，鼠标悬停图标上查看一个工具提示列出受影响的组件。

● 橘色           ：该主机上至少有一个 slave 组件挂掉了，鼠标悬停图标上查看一个工具提示列出受影响的组件。

● 黄色           : Ambari Server 没有从该主机上收到心跳包超过 3 分钟。

● 绿色           ：正常运行状态。

● Maintenace Mode   ：黑色 "医药箱" 图标指出一部主机处于维护模式。

● Alert           ：红色方框带有一个数字指明该主机上的警报数量。

红色图标覆盖橘色图标，橘色图标覆盖黄色图标。换句话说，一部主机有 master component 宕机附有一个红色图标，即便它可能也有 slave component 和连接问题。主机处于维护模式或遇到警报，图标出现在主机名右侧。

3.2 查找主机页面 (Searching the Hosts Page)

可以查找完全主机列表，通过主机名，组件属性，以及组件操作状态过滤查找。也可以通过关键字查找，简单地在搜索框内输入一个单词。

主机搜索工具在主机列表上方

① 单击搜索框

出现可用的搜索类型，包括：

通过主机属性搜索   ：通过 host name, IP, host status 以及其他属性

Search by Service   ：通过给定一个服务，查找运行此服务组件主机

Search by Component   ：查找运行某组件处于给定状态的主机，例如 started, stopped, maintenance mode, 等等。

Search by keyword   ：在搜索框输入任何单词描述要查找的内容，这成为一个文本过滤器。

② 单击搜索类型

出现一个可用选项的列表，取决于在第一步中的选择

例如，如果选择单击了 Service, 当前服务出现

③ 单击一个选项

匹配当前搜索条件的列表显示到 Hosts 页面

④ 单击下一选项再次调整搜索

3.3 执行主机级别的动作 (Performing Host-Level Actions)

利用 Actions UI 控件对集群主机执行动作。可以执行的动作(Actions)由一个一上的操作(operation)组成，可能在多个主机上，也称为批量操作(bulkoperations).

Actions 控件由三个顺序的菜单精确定义(to refine your search) 的工作流组成：一个主机菜单，一个基于主机选择的对象菜单，基于对象选择的动作菜单。

例如，如果要重启集群中任何存在 RegionServers 主机的 RegionServers 服务组件：

① 在 Hosts 页面，选择或查找运行 RegionServer 到主机：

② 利用 Actions 控件，单击 Fitered Hosts > RegionServers > Restart

③ 单击 OK 来启动选定的操作

④ 可选地，监控后台操作，诊断或处理重启操作故障

3.4 管理主机上的组件 (Managing Components on a Host)

管理特定主机上运行的组件，在 Hosts 页面列出的 FDQN 中单击一个，那个主机的页面出现，单击 Summary 选项卡显示组件面板列出该主机安装的所有组件

要管理一部主机上所有的组件，可以利用显示窗口右上角的 Host Actions 控件来对所选主机上安装的所有组件 start, stop, restart, delete, 或turn on maintenance mode

另一方面，可以管理单个组件，利用在组件面板内显示在每个单独组件旁边的下拉菜单。每个组件的菜单标示了组件当前的操作状态。打开菜单，显示可用的管理选项，基于标示的状态。例如，可以 HDFS 的 DataNode 组件执行 decommission, restart, or stop 动作

3.5 退役一个 Master 或 Slave (Decommissioning a Master or Slave)

退役是支持从集群中移除组件和它们的主机的过程。在移除主机或从服务上移除主机之前，必须退役运行在该主机上的 master 或 slave 服务。退役有助于保护数据丢失或服务损坏。退役对于下列组件类型可用：DataNodes、 NodeManagers、RegionServers

退役执行下列任务：

对于 DataNodes       ：安全地复制 HDFS 数据到集群中其他的 DataNodes

对于 NodeManagers   ：停止接受新作业的请求并停止组件

对于 RegionServers   ：打开 drain mode 并停止组件

3.6 退役和删除组件

3.6.1 退役一个组件 (Decommission a Component)

① 利用 Ambari Web，浏览到 Hosts 页面

② 找到并单击组件驻留的主机 FQDN

③ 使用 Actions 控件，单击 Selected Hosts > DataNodes > Decommission

过程中 UI 显示退役中(Decommissioning)状态

退役过程完成时，退役状态变为已退役 (Decommissioned)

3.6.2 删除一个组件 (Delete a Component)

① 利用 Ambari Web，浏览到 Hosts 页面

② 找到并单击组件驻留的主机 FQDN

③ 在 Components 中, 找到一个要退役的组件

④ 如果该组件的状态是 Started, 停止它

一个退役的 slave 组件可以在已退役状态重启

⑤ 从组件下拉菜单中单击 Delete

删除一个 slave 组件，如一个 DataNode 不会自动通知 master 组件，如 NameNode 从它的排除列表中移除那个 slave 组件。添加一个已删除的组件回到集群表现出如下问题，从 master 的视角观察，添加进来的 slave 保持在退役状态。重启 master 组件可排除故障

⑥ 让 Ambari 识别并监控余下的组件，重启服务。

3.7 从集群删除一个主机 (Deleting a Host from a Cluster)

删除一个主机从集群中移除该主机

先决条件：在删除一部主机之前，必须完成如下前提：

● 停止该主机上运行的所有组件

● 退役运行在该主机上的所有 DataNode

● 迁移该主机上所有的 master 组件，例如 NameNode 或 ResourceManager

● 关闭主机的维护模式(Maintenance Mode)

步骤：

① 利用 Ambari Web，浏览到 Hosts 页面, 找到并单击要删除的主机 FQDN

② 在 Host-Details 页面，单击 Host Actions

③ 单击 Delete

3.8 设置维护模式 (Setting Maintenance Mode)

在一个 Ambari-managed 集群上，当要专注于执行硬件或软件维护，修改配置设置，处理故障，退役，或移除集群节点时，设置维护模式可以阻止警报，并

去掉在特定服务，组件，以及主机上的批操作(omit bulk operations)。

显示设置一个服务的维护模式，隐含地设置了运行此服务的组件和主机的维护模式。如果维护模式阻止了要执行在服务，组件，或主机上的批操作，可以在

维护模式中显式地启动和停止服务、组件、或主机。

下面几节提供了一个案例，如何在有三个节点，Ambari 管理集群上使用维护模式。描述如何显式地打开(turn on) HDFS 服务的维护模式，主机，以及隐式地

打开服务、组件，以及主机的维护模式。

3.8.1 设置服务维护模式 (Set Maintenance Mode for a Servicee)

① 在 Services 页面，选择 HDFS

② 选择 Service Actions, 然后选择 Turn On Maintenance Mode

③ OK 确认

注意，在 Services Summary, NameNode 和 SNameNode 组件的 Maintenance Mode 打开

3.8.2 设置主机维护模式 (Set Maintenance Mode for a Host)

使用 Host Actions 控件设置主机维护模式

步骤：

① Hosts 页，选择主机 FDQN

② 选择 Host Actions, 然后选择 Turn On Maintenance Mode.

③ OK 确认

注意，主机上所有的组件打开维护模式

使用 Actions 控件设置主机维护模式

步骤：

① Hosts 页，选择主机 FDQN

② 在 Actions > Selected Hosts > Hosts, 选择 Turn On Maintenance Mode.

③ OK 确认

3.8.3 何时设置维护模式 (When to Set Maintenance Mode)

设置维护模式的四个一般场景为：执行维护，测试配置修改，测底删除一个服务，处理警报。

■ 要在一部主机上执行硬件或操作系统维护

执行维护时，要能够做如下操作：

● 阻止这部主机上所有组件生产警报

● 能够停止、启动、以及重启主机上的每一个组件

● 阻止该主机 host-level 或 service-level 的 starting, stopping, 或 restarting 组件批操作为了达成这些目标，显示设置主机的维护模式，将这部主机上所有的组件隐式地设置为维护模式。

■ 要测试一个服务配置的修改。应该停止、启动、以及重启服务来测试重启是否激活了配置的变化

要测试配置信息的变化，要确保如下条件：

● 这个服务上没有任何组件生成警报

● 这个服务上没有 host-level 货 service-level 的批操作启动、停止、或重启组件

为了达成这些目标，显示设置服务维护模式。将一个服务设置为维护模式隐式地为该服务的所有组件打开维护模式

■ 要停止一个服务

要完全停止一个服务，需要确保如下条件：

● 这个服务没有生成 warnings

● 没有由 host-level 的动作或批操作导致的组件启动，停止，或重启

为了达成这些目标，显示为服务设置维护模式。将一个服务设置为维护模式隐式地为该服务的所有组件打开维护模式

■ 要停止一个主机组件生成警报

要停止一个主机组件生成警报，必须能够做到如下内容：

● 检查组件

● 访问该组件生成的 warnings 和 alerts

为了达成这些目标，为主机组件显示设置维护模式。将主机组件设置为维护模式，阻止 prevents host-level 和 service-level 批操作 starting 或restarting 该组件。可以在维护模式开启状态系显示重启该组件。

3.9 向集群添加主机 (Add Hosts to a Cluster)

① 浏览到 Hosts 页面然后选择 Actions > +Add New Hosts

Add Host 向导提供一系列提示类似于 Ambari 集群安装向导(Ambari Cluster Install wizard.)

② 跟随提示，提供相关信息，继续完成向导

3.10 建立机架感知 (Establishing Rack Awareness)

有两种方法建立机架感知。要么使用 Ambari 设置 rack ID, 或者利用自定义拓扑脚本(topology script) 设置 rack ID.

3.10.1 利用 Ambari 设置机架 ID (Set the Rack ID Using Ambari)

通过设置 Rack ID, 使 Ambari 为主机管理机架信息，包括在 heatmaps 中通过 Rack ID 显式主机，使用户能过滤并在 Hosts 页面通过 Rack ID 查找主机

如果集群中安装了 HDFS, Ambari 通过使用拓扑脚本将 Rack ID 信息传递给 HDFS. Ambari 生成的拓扑脚本在 /etc/hadoop/conf/topology.py 位置，并自动设置 core-site 中的 net.topology.script.file.name 属性。这个脚本读取一个 Ambari 自动生成的 /etc/hadoop/conf/topology_mappings.data 映射文件。当你在 Ambari 中修改 Rack ID 分配时，这个映射文件会在推进(push out) HDFS 配置信息时更新。HDFS 利用这个拓扑脚本获得 DataNode 主机的机架信息。有两种方法利用 Ambari Web 设置 Rack ID: 对于多主机，使用 Actions, 或者对于单个的主机，使用 Host Actions

■ 为多个主机设置 Rack ID

步骤：

① 使用 Actions, 单击 selected, filtered, 或 all hosts

② 单击 Hosts.

③ 单击 Set Rack

■ 在单个主机上设置 Rack ID

步骤：

① 浏览到 Host 页面

② 单击 Host Actions

③ 单击 Set Rack

3.10.2 利用自定义拓扑脚本设置机架 ID (Set the Rack ID Using a Custom Topology Script)

如果不想 Ambari 管理主机到机架信息，可以使用自定义到拓扑脚本。要做到这一点，必须创建自己的拓扑脚本管理分布脚本到所有主机。注意，也因为Ambari 不能访问到主机机架信息，Ambari Web 中的 heatmaps 不能显示机架。

使用自定义脚本设置 Rack ID:

步骤：

① 浏览到 Services > HDFS > Configs

② 修改 net.topology.script.file.name 为自己的自定义拓扑脚本

如，/etc/hadoop/conf/topology.sh

③ 分布拓扑脚本到所有主机上

现在，可以为 Ambari 之外的脚本管理机架映射信息了。
管理服务 (Managing Services)
利用 Ambari Web UI 主页的 Services 选项卡监控和管理运行于集群上选定的服务。

集群上安装的所有服务列于左侧的面板上：

4.1 启动和管理所有服务 (Starting and Stopping All Services)

同时启动或停止列出的所有服务，单击 Actions 然后单击 Start All 或 Stop All:

4.2 显示服务操作摘要 (Displaying Service Operating Summary)

从服务列表上单击服务的名称，显示出 Summary 选项卡含有关于此服务操作状态的基本信息，包括警报。要刷新监控面板并显示另一个服务的信息，可以在服务列表上单击一个不同的服务名称。

注意服务名称后面带有颜色的图标，指出服务的操作状态和该服务生成的警报。可以单击一个 View Host 链接来查看组件和运行选定组件的主机。

4.2.1 警报和健康检查 (Alerts and Health Checks)

在 Summary tab, 可以单击 Alerts 来查看所有健康检查列表以及所选中服务的状态，重要警报首先显示。要查看警报定义，可以单击列表中每个警报消息的文本标题来查看警报定义。例如单击 HBase > Services > Alerts > HBase Master Process

4.2.2 修改服务表盘 (Modifying the Service Dashboard)

取决于所选择的服务，Summary tab 包含一个 Metrics 表盘，默认含有重要的服务度量的监控

如果安装了 Ambari Metrics 服务并使用 Apache HDFS, Apache Hive, Apache HBase, 或 Apache YARN, 可以自定义度量表盘。可以向 Metrics 表盘添加

或从表盘上移除 widget, 并可以创建新的 widget 或删除 widget。widget 可以是对你或你的表盘私有的(private), 或者可以共享到 Widget Browser 库。

必须已经安装 Ambari Metrics 服务才能查看，创建，以及自定义 Metrics 表盘。

4.2.2.1 添加或移除一个 Widget (Adding or Removing a Widget)

要在 HDFS, Hive, HBase, 或 YARN 服务的 Metrics 表盘中添加或移除一个 widget:

① 或者单击 + 号图标启动 Widget Browser, 或者从 Actions > Metrics 单击 Widget Browser

② Widget Browser 显示可以添加到服务表盘中的 widget, 包括已经包含在表盘中的，共享的 widget, 以及已创建的 widget.

③ 如果只要显示自己创建的 widget,选择 "Show only my widgets" 复选框

④ 如果要移除一个添加到表盘中的 widget, 单击它的移除图标

⑤ 如果要添加一个还没有添加进来的可用 widget, 单击 Add

4.2.2.2 创建一个 Widget (Creating a Widget)

① 单击 + 图标启动 Widget Browser

② 或者单击 Create Widget 按钮，或者在 Actions 菜单上单击 Create Widget

③ 选择创建的 widget 类型

④ 取决于服务和 widget 类型，可以选择度量和使用的操作符创建表达式来咋 widget 中显式在构建表达式时会显式 widget 的预览。

⑤ 输入 widget 的名称和描述

⑥ 可选地，选择共享此 widget

共享 widget 使这个 widget 对集群中所有用户可用。一个 widget 共享之后，其他 Ambari Admins 或 Cluster Operators 可以修改或删除这个widget, 这是不可恢复的。

4.2.2.3 删除一个 Widget (Deleting a Widget)

① 单击 + 图标启动 Widget Browser, 或者从 Actions > Metrics 单击 Widget Browser

② Widget Browser 显示可以添加到服务表盘中的 widget, 包括共享的和已创建的 widget

③ 如果一个 widget 已添加到表盘，它会显式为 Added, 单击它可以移除

④ 对于自己创建的 widget, 可以选择 More... 选项删除

⑤ 对于共享的 widget, 如果是 Ambari Admin 或 Cluster Operator, 也会有选项删除

删除一个共享的 widget 会从所有用户删除，此过程不可逆

4.2.2.4 导出 Widget 图形数据 (Export Widget Graph Data)

可以利用 Export 能力从 widget 图表中导出度量数据

① 将鼠标指针悬停在 widget 图表上面，单击图表放大显示，显示 Export 图标

② 单击图标并制定 CSV 或 JSON 格式

4.2.2.5 设置显示时区 (Setting Display Timezone)

可以设置时区用于显示 widget 图表中的度量数据

① Ambari Web 中，单击用户名病选择 Settings

② 在 Locale 节，选择 Timezone.

③ 单击 Save

Ambari Web UI 重新载入并使用新设置的时区显示图表。

4.3 添加服务 (Adding a Service)

Ambari 安装向导默认安装所有可用的 Hadoop 服务。可以在初始安装时仅选择部署一部分服务，然后在需要时安装其他服务。例如，有些有些用户在初始

安装时只选择安装核心 Hadoop 服务。 Actions 控件的 Add Service 选项可以在不中断 Hadoop 集群操作情况下部署其他服务。当部署了所有可用当服务后，

Add Service 控件显示为无效，表明它不可用。

添加服务，下面步骤展示了向 Hadoop 集群添加 Apache Falcon 服务的例子:

(1) 单击 Actions > Add Service

打开 Add Service wizard

(2) 单击 Choose Services

Choose Services 面板显示，已激活的服务显示为绿色背景并且其复选框被选中。

(3) 在 Choose Services 面板上，选择要添加服务前面的复选框，然后单击 Next

(4) 在 Assign Masters 页面，确认默认的主机分配。

Add Services Wizard 指示所选服务的 master 组件安装的主机。另一方面，利用下拉菜单选择不同的主机，让所选服务的 master 组件添加到该主机上。

(5) 如果要添加的服务要求 slaves 和 clients, 在 Assign Slaves and Clients 页，接受默认的 slave 和 client 组件分配的主机，单击 Next，另一方面，选择要安装 slave 和 client 组件的主机，然后单击 Next

(6) 在 Customize Services, 接受默认的配置属性

另一方面，如有必要，编辑默认的配置属性值。选择 Override 为此服务创建一个配置组，然后，选择 Next

(7) 在 Review 页，验证配置设置符合期望，然后单击 Deploy

(8) 监控安装，启动，以及测试服务的过程，当成功结束时，单击 Next

(9) 当看到安装结果的摘要显示时，单击 Complete

(10) 查看并确认建议的配置修改

(11) 重新启动其他组件，因新增加了服务，其配置已过时。

4.4 执行服务动作 (Performing Service Actions)

通过执行服务动作来管理集群上一个选定的服务。在 Services tab, 单击 Service Actions 然后单击一个选项。可用的选项取决于选定的服务。例如，HDFS

服务动作，单击 Turn On Maintenance Mode 会阻止该服务生成的警报和状态变化指示，但允许对该服务上启动，停止，重启，迁移，或执行维护任务。

4.5 滚动重启 (Rolling Restarts)

当重启多个服务、组件、或主机时，使用 rolling restarts 来分布任务。一个滚动重启，使用一个批次序列停止并启动多个运行中的 slave 组件，例如

DataNodes, NodeManagers, RegionServers, or Supervisors .

重要提示：

DataNodes 的滚动重启只能在集群维护期间执行。

可以设置滚动重启的的参数值以控制服务的数量，间隔时间，容错限度，以及在大型集群上重启组件数量的限制。

要运行一个滚动重启，执行下列步骤：

① 在 Service 页面左侧的服务列表上，单击一个服务名称

② 在服务的 Summary 页面，单击一个链接，例如 DataNodes 或 RegionServers, 任何要重启的组件Hosts 页面列出集群上存在有所选组件的主机名称

③ 利用 host-level 的 Actions 菜单，单击一个 slave 组件的名称，然后单击 Restart.

④ 为 Rolling Restart Parameters 查看并设置值

⑤ 可选地，重置标志来重启仅修改了配置的组件

⑥ 单击 Trigger Restart

触发重启之后，应该监控后台操作的过程。

4.5.1 设置滚动重启参数 (Setting Rolling Restart Parameters)

选择重启从属组件时，可以利用参数来控制如何重启组件滚动。参数值默认为集群上组件总数的 10%, 例如，对于在有三个节点的集群中的组件, 一个滚动

重启的默认设置是一次重启一个组件，重启间隔是等待 2 分钟，如果只有一个出现故障就继续，并重启运行此服务的所有组件。所有参数输入整数，非零值

Batch Size       ：包含在每次重启批次里的组件数量

Wait Time       ：每个批次组件排队等候的数据(秒单位)

Tolerate up to x failures   ：跨所有批次，在挂起重启并不在排队批次之前，重启失败容许的总数。

4.5.2 终止滚动重启 (Aborting a Rolling Restart)

要终止批次中将来的滚动重启，单击 Abort Rolling Restart

4.6 监控后台操作 (Monitoring Background Operations)

可以利用 Background Operations 窗口监控一个由多个操作组成的任务进度和完成情况，例如重启组件。当运行这样一个任务时，Background Operations

窗口默认是打开的。例如监控一个滚动重启的进度，单击 Background Operations 窗口中的元素：

① 单击每个操作的右箭头显示每一部主机上的重启操作进度

② 重启操作完成后，可以单击右箭头或主机名来查看日志文件以及选定主机上生成的错误信息

② 可选地，可以利用 Background Operations 窗口右上角的 Copy, Open, or Host Logs 图标来复制，打开，或查看操作日志。

也可以选择 Background Operations 窗口底部的复选框来在将来执行任务时隐藏该窗口。

4.7 移除一个服务 (Removing A Service)

重要提示：

移除一个服务是不可逆的并且所有的配置历史将丢失

步骤：

① 在 Services tab 页面的左侧面板，单击服务名称

② 单击 Service Actions > Delete.

③ 提示时，移除任何依赖服务

④ 提示是，停止服务的所有组件

⑤ 确认移除

服务停止后，必须确认移除

4.8 操作审计 (Operations Audit)

当利用 Ambari 执行操作时，例如用户登录或退出，停止或启动服务，添加或移除服务， Ambari 会在一个审计日志中创建一条内容。通过读取审计日志，

可以确定谁执行了操作，操作是什么时间发生的，以及其他操作特定的信息。可以在 Ambari server 主机上找到 Ambari 审计日志：

/var/log/ambari-server/ambari-audit.log

当修改了一个服务的配置信息，Ambari 在审计日志中创建一条内容，并创建一个特殊的日志文件：

ambari-config-changes.log

通过读取配置修改日志，可以发现每次配置修改更多的信息，例如：

2016-05-25 18:31:26,242 INFO - Cluster 'MyCluster' changed by: 'admin';

service_name='HDFS' config_group='default' config_group_id='-1' version='2'

4.9 使用快速链接 (Using Quick Links)

选择 Quick Links 选项可以访问选定服务的一些额外的信息源，例如 HDFS 的 Quick Links 选项包括如下内容：

NameNode JMX

NameNode Logs

Thread Stacks

NameNode UI

Quick Links 不是对每个服务都可用

4.10 刷新 YARN 容量调度器 (Refreshing YARN Capacity Scheduler)

修改 Capacity Scheduler 配置之后，如果没有进行破坏性修改配置信息，YARN 可以不需要重启 ResourceManager 刷新队列。如果执行了破坏性修改，例如

删除一个队列，刷新操作会失败并输出如下信息：Failed to re-init queues . 当进行破坏性修改时，必须执行 ResourceManager 重启来使容量调度器的

修改生效。

刷新 Capacity Scheduler, 执行如下步骤：

① 在 Ambari Web, 浏览到 Services > YARN > Summary.

② 单击 Service Actions, 然后单击 Refresh YARN Capacity Scheduler

③ 确认要执行此操作

刷新操作提交给 YARN ResourceManager

4.11 管理 HDFS (Managing HDFS)

4.11.1 重均衡 HDFS (Rebalancing HDFS)

HDFS 提供了一个 a "balancer" 工具帮助均衡集群中数据块跨 DataNodes 分布。启动均衡进程，执行下列步骤：

① 在 Ambari Web 中，浏览到 Services > HDFS > Summary

② 单击 Service Actions, 然后单击 Rebalance HDFS.

③ 输入 Balance Threshold 值作为磁盘容量到百分比

④ 单击 Start

可以通过打开 Background Operations 窗口监控或取消重均衡进程。

4.11.2 调整垃圾回收 (Tuning Garbage Collection)

Concurrent Mark Sweep (CMS) garbage collection (GC) 进程包括一系列启发式规则用于触发垃圾回收。这使得垃圾回收是不可预测的并趋向于延迟回收，直到抵达容量水平，产生一个 Full GC 错误(有可能中断所有进程)

Ambari 在集群部署期间设置了很多属性的默认值。在 hadoop-env 模板中到 export HADOOP_NameNode_Opts= 子句，有两个参数影响 CMS GC 进程，有如下的默认设置：

● -XX:+UseCMSInitiatingOccupancyOnly

阻止使用 GC 启发

● -XX:CMSInitiatingOccupancyFraction=<percent>

告知 Java VM 何时 CMS 收集器被触发

如果这个值设置得过低，CMS 收集器运行过于频繁；如果设置过高，CMS 收集器触发得太晚，并且可能发生 concurrent mode failure. 默认设置

-XX:CMSInitiatingOccupancyFraction 的值为 70, 意味着应用程序应该利用少于 70% 的容量。

通过修改 NameNode CMS GC 参数来调整垃圾回收，执行如下步骤：

① 在 Ambari Web, 浏览到 Services > HDFS.

② 打开 Configs tab, 并浏览到 Advanced > Advanced hadoop-env

③ 编辑 hadoop-env 模板

④ 保存配置并有提示出现，重启

4.11.3 自定义 HDFS 主目录 (Customizing the HDFS Home Directory)

默认情况下，HDFS 的用户主目录为 /user/<user_name>. 可以利用 dfs.user.home.base.dir 属性自定义 HDFS 主目录

① 在 Ambari Web, 浏览到 Services > HDFS > Configs > Advanced.

② 单击 Custom hdfs-site, 然后单击 Add Property

③ 在弹出到 Add Property 中，添加如下属性：

dfs.user.home.base.dir=<home_directory>

④ 单击 Add, 然后在提示是，保存新配置病重启

4.12 在 Storm 环境内管理 Atlas (Managing Atlas in a Storm Environment)

在 Ambari 中更新 Apache Atlas 配置设置时，Ambari 标记此服务要求重启。要重启这些服务，执行如下步骤：

① 在 Ambari Web, 单击 Actions 控件

② 单击 Restart All Required

提示：

Apache Oozie 在一个 Atlas 配置更新后要求重启，但在 Ambari 中可能没有标记为要求重启。如果 Oozie 没有包含进来，执行如下步骤重启 Oozie:

① 在 Ambari Web, 在服务摘要面板单击 Oozie

② 单击 Service Actions > Restart All.

4.13 启用 Oozie UI (Enabling the Oozie UI)

Ext JS 是 GPL 许可证的软件，并且不再包含在 HDP 2.6 中。因此 Oozie WAR 文件没有构建到 Ext JS-based 用户接口程序中，除非 Ext JS 手动安装到Oozie server. 如果使用 Ambari 2.6.1.3 添加 Oozie 到 HDP2.6.4 或更高版本，默认没有 Oozie UI 可用。如果想要 Oozie UI，必须手动安装 Ext JS到 Oozie server 主机。在重启操作期间，Ambari 重构这个 Oozie WAR 文件并包含 Ext JS-based Oozie UI

步骤：

① 登录到 Oozie Server 主机

② 下载并安装 Ext JS 包

CentOS RHEL Oracle Linux 7:

wget http://public-repo-1.hortonworks.com/HDP-UTILS-GPL-1.1.0.22/repos/centos7/extjs/extjs-2.2-1.noarch.rpm

rpm -ivh extjs-2.2-1.noarch.rpm

③ 移除如下文件：

rm /usr/hdp/current/oozie-server/.prepare_war_cmd

④ 在 Ambari UI 上重启 Oozie Server

Ambari 会重构 Oozie WAR 文件

5. 管理服务高可用性 (Managing Service High Availability)

Ambari web 提供了向导驱动的用户体验，可以配置一些 Hortonworks Data Platform (HDP) stack 服务组件的高可用性。高可用性通过建立主(primary)

和从(secondary) 组件来提供保险。在主组件故障或变为不可用情况下，从组件成为可用。为一个服务配置了高可用性之后，Ambari 可以管理或禁用((roll

back) 该服务内组件的高可用性。

5.1 NameNode 的高可用性 (NameNode High Availability)

为了确保集群上在主 NameNode 主机故障时，另一个 NameNode 总是可用，可用利用 Ambari Web 在集群上启用并配置 NameNode 高可用性。

5.1.1 配置 NameNode 的高可用性 (Configuring NameNode High Availability)

前提要求：

● 核实集群中至少有三部主机，并且至少运行三个 Apache ZooKeeper servers

● 确保 Hadoop Distributed File System (HDFS) 和 ZooKeeper 没有运行在维护模式

在启用 NameNode HA 时，HDFS 和 ZooKeeper 必须停止然后启动。维护模式会阻止这类启动和停止操作。如果 HDFS 或 ZooKeeper 处于维护模式，

NameNode HA 向导不会完全成功。

步骤：

(1) 在 Ambari Web, 选择 Services > HDFS > Summary.

(2) 单击 Service Actions, 然后单击 Enable NameNode HA

(3) Enable HA wizard 启动。这个向导描述了一系列必须执行的自动和手动的步骤来建立 NameNode 高可用性

(4) 在 Get Started 页面，输入 Nameservice ID, 然后单击 Next

在设置了 HA 之后，使用这个 Nameservice ID 而不是 NameNode FDQN

(5) 在 Select Hosts 页面，选择一部主机最为附加 NameNode 以及 JournalNodes,然后单击 Next

(6) 在 Review 页，确认主机的选择，然后单击 Next

(7) 跟随 Manual Steps Required: Create Checkpoint on NameNode 页面上的指导，单击 Next

必须登录到当前 NameNode 主机并运行命令，将 NameNode 置于安全模式并创建检查点

(8) 当 Ambari 检测成功，并且窗口底部的消息变为 Checkpoint created, 单击 Next

(9) 在 Configure Components 页面，监控配置进度条，然后单击 Next

(10)在 Manual Steps Required: Initialize JournalNodes 页面跟随指导，然后单击 Next

必须登录到当前 NameNode 主机运行命令来初始化 JournalNodes.

(11)当 Ambari 检测成功，并窗口底部的消息变为 JournalNodes initialized 时，单击 Next

(12)在 Start Components 页面，监控 ZooKeeper servers 和 NameNode 启动进度条，然后单击 Next在启用 Ranger 的集群上，并且 Hive 配置为使用 MySQL, 如果 MySQL 停止，Ranger 会启动失败。要解决这个问题，启动 Hive 的 MySQL 数据库，然后重试启动组件

(13)在 Manual Steps Required: Initialize NameNode HA Metadata 页面，根据页面上的指导，完成每一步骤，然后单击 Next，在这一步，必须登录到当前 NameNode 和附加 NameNode 主机。确保每个命令登录到正确的主机，在完成每一个命令后，单击 OK 确认。

(14)在 Finalize HA Setup 页，监控向导完成 HA 设置的进度条，单击 Done 结束向导。在 Ambari Web UI 重新载入之后，可能会看到一些警报通知。等几分钟直到所有服务重启

(15)如果必要，使用 Ambari Web 重启任何组件

(16)如果使用 Hive, 必须手动修改 Hive Metastore FS root 指向 Nameservice URI 而不是 NameNode URI. 在 Get Started 步骤创建的 Nameservice ID

步骤：

a. 在 Hive 主机上找到 FS root：

hive --config /etc/hive/conf/conf.server --service metatool -listFSRoot

输出类似于：

Listing FS Roots... hdfs://<namenodehost>/apps/hive/warehouse.

b. 修改 FS root：

$ hive --config /etc/hive/conf/conf.server --service metatool -updateLocation <new-location><old-location>

例如，如果 Nameservice ID 为 mycluster, 输入为：

$ hive --config /etc/hive/conf/conf.server --service metatool -updateLocation hdfs://mycluster/apps/hive/warehouse   \

hdfs://c6401.ambari.apache.org/apps/hive/warehouse

输出类似于：

Successfully updated the following locations...Updated X records in SDS table

(17)调整 ZooKeeper Failover Controller retries 设置环境

a. 浏览到 Services > HDFS > Configs > Advanced core-site

b. 设置 ha.failover-controller.active-standbyelector.zk.op.retries=120.

下面步骤：

查看并确认所有建议的配置修改

5.1.2 回滚 NameNode 的高可用性 (CRolling Back NameNode HA)

要禁用(roll back) NameNode 高可用性，执行如下步骤(取决于安装)

(1)   停止 HBase

(2)   检查点活动 NameNode

(3)   停止所有服务

(4)   为回滚准备 Ambari Server Host

(5)   恢复 HBase 配置

(6)   删除 ZooKeeper Failover 控制器

(7)   修改 HDFS 配置

(8)   重新创建 Secondary NameNode

(9)   重新启用 Secondary NameNode

(10)删除所有 JournalNodes

(11)删除附属 NameNode

(12)验证 HDFS 组件

(13)启动 HDFS

5.1.2.1 停止 HBase (Stop HBase)

① 在 Ambari Web 集群表盘，单击 HBase 服务

② 单击 Service Actions > Stop

③ 等待，直到 HBase 完全停止，然后继续

5.1.2.2 检查点活动 NameNode (Checkpoint the Active NameNode)

如果在启用 NameNode HA 之后使用了 HDFS, 但想要回转到非 HA 状态，进行回滚之前必须要设置 HDFS 状态检查点。

如果在 Enable NameNode HA wizard 操作过程中失败并需要回转，可以忽略此步骤，继续进行停止所有服务。

设置 HDFS 状态检查点要求不同的语法，取决于集群上是否启用了 Kerberos 安全

   ● 如果集群上没有启用 Kerberos 安全，在活动 NameNode 主机上使用如下命令来保存名称空间

   sudo su -l <HDFS_USER> -c 'hdfs dfsadmin -safemode enter' sudo su -l <HDFS_USER> -c 'hdfs dfsadmin -saveNamespace'

   ● 如果集群上已经启用了 Kerberos 安全，使用如下命令来保存名称空间：

   sudo su -l <HDFS_USER> -c 'kinit -kt /etc/security/keytabs/nn.service.keytab nn/<HOSTNAME>@<REALM>;hdfs dfsadmin -safemode   \

   enter' sudo su -l <HDFS_USER> -c 'kinit -kt /etc/security/keytabs/nn.service.keytab nn/<HOSTNAME>@<REALM>;hdfs dfsadmin -saveNamespace'

   本例中 <HDFS_USER> 是 HDFS 服务的用户(如 hdfs), <HOSTNAME> 是 Active NameNode 主机名，<REALM> 是 Kerberos realm.

5.1.2.3 停止所有服务 (Stop All Services)

在停止 HBase, 并且如有必要设置了 Activ NameNode 检查点之后，停止所有服务

   ① 在 Ambari Web, 单击 Services tab

   ② 单击 Stop All

   ③ 等待所有服务停止完成之后，继续

5.1.2.4 为回滚准备 Ambari Server 主机 (Prepare the Ambari Server Host for Rollback)

为回滚过程准备：

   ① 登录到 Ambari server 主机

   ② 设置如下环境变量

       export AMBARI_USER=AMBARI_USERNAME   ：替换为 Ambari Web 系统管理员，默认值为 admin

       export AMBARI_PW=AMBARI_PASSWORD   ：替换为Ambari Web 系统管理员的口令，默认值为 admin

       export AMBARI_PORT=AMBARI_PORT       ：替换为 Ambari Web 端口，默认为 8080.

       export AMBARI_PROTO=AMBARI_PROTOCOL   ：替换为连接到 Ambari Web 使用的协议，选项为 http 或 https, 默认为 http

       export CLUSTER_NAME=CLUSTER_NAME   ：替换为集群名称，如 mycluster

       export NAMENODE_HOSTNAME=NN_HOSTNAME   ：替换为非 HA 的 NameNode 主机 FDQN, 例如 namenode.mycompany.com

       export ADDITIONAL_NAMENODE_HOSTNAME=ANN_HOSTNAME   ：替换为设置 HA 时使用的附属 NameNode 主机的 FDQN

       export SECONDARY_NAMENODE_HOSTNAME=SNN_HOSTNAME       ：替换为非 HA 设置的 secondary NameNode 主机的 FDQN

       export JOURNALNODE1_HOSTNAME=JOUR1_HOSTNAME   ：替换为第一 Journal 节点主机的 FDQN

       export JOURNALNODE2_HOSTNAME=JOUR2_HOSTNAME   ：替换为第二 Journal 节点主机的 FDQN

       export JOURNALNODE3_HOSTNAME=JOUR3_HOSTNAME ：替换为第三 Journal 节点主机的 FDQN

   ③ 多检查几遍这些环境变量设置正确

5.1.2.5 恢复 HBase 配置 Host (Restore the HBase Configuration)

如果安装了 HBase, 可能需要恢复到 HA 状态之前的配置。

   Note：

       对于 Ambari 2.6.0 及更高版本，不再支持 config.sh 并且会失败。使用 config.py

   ①    从 Ambari server 主机上，确定当前的 HBase 配置是否必须恢复：

       /var/lib/ambari-server/resources/scripts/configs.py -u <AMBARI_USER> -p <AMBARI_PW> -port <AMBARI_PORT> get localhost    \

       <CLUSTER_NAME>   hbase-site

       使用为回滚准备 Ambari Server 主机设置的环境变量应用命令中的环境变量名。

       如果 hbase.rootdir 设置为 Enable NameNode HA 向导中设置的 NameService ID, 必须回转 hbase-site 到非 HA 的值。例如，在

       "hbase.rootdir":"hdfs://<name-service-id>:8020/apps/hbase/data" 中，hbase.rootdir 属性指向 NameService ID, 因此这个值必须回滚。

       如果 hbase.rootdir 指向一个特定的 NameNode 主机，它就没必要回滚。"hbase.rootdir":"hdfs://<nn01.mycompany.com>:8020/apps/hbase/data",

       hbase.rootdir 指向了一个特定的 NameNode 主机而不是 NameService ID, 这就不需要回滚，可以继续进行 ZooKeeper failover 控制器删除

   ② 如果必须要回滚 hbase.rootdir 值，在 Ambari server 主机上，使用 configs.py 脚本进行必要的修改：

       /var/lib/ambari-server/resources/scripts/configs.py -u <AMBARI_USER> -p<AMBARI_PW> -port <AMBARI_PORT> set

       localhost <CLUSTER_NAME> hbase-site hbase.rootdir hdfs://<NAMENODE_HOSTNAME>:8020/apps/hbase/data

       使用为回滚准备 Ambari Server 主机设置的环境变量应用命令中的环境变量名

   ③   在 Ambari server 主机上，验证 hbase.rootdir 属性已恢复正确：

       /var/lib/ambari-server/resources/scripts/configs.py -u <AMBARI_USER> -p <AMBARI_PW> -port <AMBARI_PORT> get localhost \

       <CLUSTER_NAME> hbase-site

       hbase.rootdir 属性现在应该与 NameNode 主机名相同而不是 NameService ID.

5.1.2.6 删除 ZooKeeper Failover 控制器 (Delete ZooKeeper Failover Controllers)

前提准备：

   如果在 Ambari 服务器主机上执行如下命令返回一个非空的 items 数组，那么必须删除 ZooKeeper (ZK) Failover Controllers：

   curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By: ambari" -i <AMBARI_PROTO>://localhost:<AMBARI_PORT>/api/v1/clusters/       \

   <CLUSTER_NAME>/host_components?HostRoles/component_name=ZKFC

删除失效控制器：

   ①   在 Ambari server 主机上，发出如下 DELETE 命令：

   curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By: ambari" -i -X DELETE <AMBARI_PROTO>://localhost:<AMBARI_PORT>/api/v1/       \

   clusters/<CLUSTER_NAME>/hosts/<NAMENODE_HOSTNAME>/host_components/ZKFC curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By:    \

   ambari" -i -X DELETE <AMBARI_PROTO>://localhost:<AMBARI_PORT>/api/v1/clusters/<CLUSTER_NAME>/hosts/<ADDITIONAL_NAMENODE_HOSTNAME>/   \

   host_components/ZKFC

   ② 验证控制器已被移除

   curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By: ambari"-i <AMBARI_PROTO>://localhost:<AMBARI_PORT>/api/v1/clusters/   \

   <CLUSTER_NAME>/host_components?HostRoles/component_name=ZKFC

   这条命令应该返回一个空的 items 数组

5.1.2.7 修改 HDFS 配置 (Modify HDFS Configurations)

可能需要修改 hdfs-site 配置和/或 core-site 配置

前提准备：

   通过在 Ambari server 主机上执行下列命令，检查是否需要修改 hdfs-site 配置：

   /var/lib/ambari-server/resources/scripts/configs.py -u <AMBARI_USER> -p <AMBARI_PW> -port <AMBARI_PORT> get localhost    \

   <CLUSTER_NAME> hdfs-site

如果看到如下属性，必须从配置中删除它们

   • dfs.nameservices

   • dfs.client.failover.proxy.provider.<NAMESERVICE_ID>

   • dfs.ha.namenodes.<NAMESERVICE_ID>

   • dfs.ha.fencing.methods

   • dfs.ha.automatic-failover.enabled

   • dfs.namenode.http-address.<NAMESERVICE_ID>.nn1

   • dfs.namenode.http-address.<NAMESERVICE_ID>.nn2

   • dfs.namenode.rpc-address.<NAMESERVICE_ID>.nn1

   • dfs.namenode.rpc-address.<NAMESERVICE_ID>.nn2

   • dfs.namenode.shared.edits.dir

   • dfs.journalnode.edits.dir

   • dfs.journalnode.http-address

   • dfs.journalnode.kerberos.internal.spnego.principal

   • dfs.journalnode.kerberos.principal

   • dfs.journalnode.keytab.file

这里的 <NAMESERVICE_ID> 是在运行 Enable NameNode HA 向导时创建的 NameService ID

修改 hdfs-site 配置：

   ①   在 Ambari Server 主机上，对每一个发现的属性执行如下命令：

       /var/lib/ambari-server/resources/scripts/configs.py -u <AMBARI_USER> -p <AMBARI_PW> -port <AMBARI_PORT> delete

       localhost <CLUSTER_NAME> hdfs-site property_name

       使用每一个要删除的属性替换 property_name

   ②   验证所以属性都已删除：

       /var/lib/ambari-server/resources/scripts/configs.py -u <AMBARI_USER> -p <AMBARI_PW> -port <AMBARI_PORT> get localhost

       <CLUSTER_NAME> hdfs-site

   ③   确定是否必须修改 core-site 配置

       /var/lib/ambari-server/resources/scripts/configs.py -u <AMBARI_USER> -p <AMBARI_PW> -port <AMBARI_PORT> get localhost

       <CLUSTER_NAME> core-site

   ④   如果看到 ha.zookeeper.quorum 属性，删除它

       /var/lib/ambari-server/resources/scripts/configs.py -u <AMBARI_USER> -p <AMBARI_PW> -port <AMBARI_PORT> delete

       localhost <CLUSTER_NAME> core-site ha.zookeeper.quorum

   ⑤   如果 fs.defaultFS 设置为 NameService ID, 将它回转到非-HA 值

       "fs.defaultFS":"hdfs://<name-service-id>" The property

       fs.defaultFS needs to be modified as it points to a NameService

       ID "fs.defaultFS":"hdfs://<nn01.mycompany.com>"

   ⑥   将 fs.defaultFS 属性回转到 NameNode 主机值



       /var/lib/ambari-server/resources/scripts/configs.py -u

       <AMBARI_USER> -p <AMBARI_PW> -port <AMBARI_PORT> set localhost

       <CLUSTER_NAME> core-site fs.defaultFS hdfs://<NAMENODE_HOSTNAME>

   ⑦   验证 core-site 属性现在正确设置了

       /var/lib/ambari-server/resources/scripts/configs.py -u

       <AMBARI_USER> -p <AMBARI_PW> -port <AMBARI_PORT> get localhost

       <CLUSTER_NAME> core-site

   fs.defaultFS 属性值应该是 NameNode 主机，并且 ha.zookeeper.quorum 属性不会出现

5.1.2.8 重新创建 Secondary NameNode (Re-create the Secondary NameNode)

需要重新创建 Secondary NameNode

前提准备：

   在 Ambari Server 主机上检查是否需要重新创建 Secondary NameNode

   curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By:

   ambari" -i -X GET <AMBARI_PROTO>://localhost:<AMBARI_PORT>/

   api/v1/clusters/<CLUSTER_NAME>/host_components?HostRoles/

   component_name=SECONDARY_NAMENODE

   如果返回一个空的 items 数组，必须重新创建 Secondary NameNode

重新创建 Secondary NameNode

   ①   在 Ambari Server 主机上，运行如下命令：

   curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By:

   ambari" -i -X POST -d '{"host_components" : [{"HostRoles":

   {"component_name":"SECONDARY_NAMENODE"}}] }' <AMBARI_PROTO>://

   localhost:<AMBARI_PORT>/api/v1/clusters/<CLUSTER_NAME>/hosts?

   Hosts/host_name=<SECONDARY_NAMENODE_HOSTNAME>

   ②   验证 Secondary NameNode 是否存在。在 Ambari server 主机上，运行如下命令：

   curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By:

   ambari" -i -X GET <AMBARI_PROTO>://localhost:<AMBARI_PORT>/

   api/v1/clusters/<CLUSTER_NAME>/host_components?HostRoles/

   component_name=SECONDARY_NAMENODE

   命令应返回一个非空数组包含 secondary NameNode

5.1.2.9 重新启用 Secondary NameNode (Re-enable the Secondary NameNode)

   ①   在 Ambari Server 主机上运行如下命令：

       curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-

       By: ambari" -i -X PUT -d '{"RequestInfo":

       {"context":"Enable Secondary NameNode"},"Body":

       {"HostRoles":{"state":"INSTALLED"}}}'<AMBARI_PROTO>://

       localhost:<AMBARI_PORT>/api/v1/clusters/<CLUSTER_NAME>/hosts/

       <SECONDARY_NAMENODE_HOSTNAME}/host_components/SECONDARY_NAMENODE

   ②   分析输出

       • 如果返回 200, 继续进行删除所有 JournalNodes

       • 如果返回 202, 等待几分钟之后，然后运行下面命令：

       curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By:

       ambari" -i -X GET "<AMBARI_PROTO>://localhost:<AMBARI_PORT>/

       api/v1/clusters/<CLUSTER_NAME>/host_components?HostRoles/

       component_name=SECONDARY_NAMENODE&fields=HostRoles/state"

       等待响应 "state" : "INSTALLED" 然后继续

5.1.2.10 删除所有 JournalNodes (Delete All JournalNodes)

可能需要删除若干个 JournalNodes

前提要求：

   在 Ambari Server 主机上检查看看是否需要删除 JournalNodes

   curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By:

   ambari" -i -X GET <AMBARI_PROTO>://localhost:<AMBARI_PORT>/

   api/v1/clusters/<CLUSTER_NAME>/host_components?HostRoles/

   component_name=JOURNALNODE

   如果返回一个空的 items 数组，可以继续，否则必须删除 JournalNodes

删除 JournalNodes：

   ①   在 Ambari Server 主机上，运行如下命令：

       curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By: ambari"

       -i -X DELETE <AMBARI_PROTO>://localhost:<AMBARI_PORT>/api/

       v1/clusters/<CLUSTER_NAME>/hosts/<JOURNALNODE1_HOSTNAME>/

       host_components/JOURNALNODE curl -u <AMBARI_USER>:<AMBARI_PW>

       -H "X-Requested-By: ambari" -i -X DELETE <AMBARI_PROTO>://

       localhost:<AMBARI_PORT>/api/v1/clusters/<CLUSTER_NAME>/hosts/

       <JOURNALNODE2_HOSTNAME>/host_components/JOURNALNODE

       curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By: ambari"

       -i -X DELETE <AMBARI_PROTO>://localhost:<AMBARI_PORT>/api/

       v1/clusters/<CLUSTER_NAME>/hosts/<JOURNALNODE3_HOSTNAME>/

       host_components/JOURNALNODE

   ②   验证所有的 JournalNodes 已被删除。在 Ambari server 主机上执行：

       curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By:

       ambari" -i -X GET <AMBARI_PROTO>://localhost:<AMBARI_PORT>/

       api/v1/clusters/<CLUSTER_NAME>/host_components?HostRoles/

       component_name=JOURNALNODE

       这条命令应返回空的 items 数组

5.1.2.11 删除附属 NameNode (Delete the Additional NameNode)

可能需要删除附属 NameNode

前提要求：

   在 Ambari server 主机上，检查是否需要删除附属 NameNode

   curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By: ambari" -i

   -X GET <AMBARI_PROTO>://localhost:<AMBARI_PORT>/api/v1/clusters/

   <CLUSTER_NAME>/host_components?HostRoles/component_name=NAMENODE

   如果返回的 items 数组含有两个 NameNode, 必须删除附属 NameNode

删除为 HA 设置的附属 NameNode:

   ①   在 Ambari Server 主机上，运行如下命令：

       curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By: ambari"

       -i -X DELETE <AMBARI_PROTO>://localhost:<AMBARI_PORT>/api/v1/

       clusters/<CLUSTER_NAME>/hosts/<ADDITIONAL_NAMENODE_HOSTNAME>/

       host_components/NAMENODE

   ②   验证附属 NameNode 已删除

       curl -u <AMBARI_USER>:<AMBARI_PW> -H "X-Requested-By: ambari" -i

       -X GET <AMBARI_PROTO>://localhost:<AMBARI_PORT>/api/v1/clusters/

       <CLUSTER_NAME>/host_components?HostRoles/component_name=NAMENODE

       返回的 items 数组应含有一个 NameNode

5.1.2.12 验证 HDFS 组件 (Verify the HDFS Components)

启动 HDFS 之前，应验证具有正确的组件

   ①   浏览到 Ambari Web UI > Services, 然后选择 HDFS

   ②   检查 Summary 面板病确保前三行类似如下：

       • NameNode

       • SNameNode

       • DataNodes

       不应看到 JournalNodes 到行

5.1.2.13 启动 HDFS (Start HDFS)

   ①   在 Ambari Web UI, 单击 Service Actions, 然后单击 Start.

   ②   如果进度条没有显示服务已完全启动并且忽略了服务检查，重做第一步

   ③   启动所有其他服务，在 Services 页面单击 Actions > Start All

5.1.3 管理 Journal 节点 (Managing Journal Nodes)

在集群上启用 NameNode 高可用性之后，必须在集群上维护至少三个活动的 Journal 节点。可以使用 Manage JournalNode 向导来分配、添加、或移除

JournalNode. Manage JournalNode 向导分配 JournalNodes, 查看并确认必要的配置修改，然后会重启集群上的所有组件，以利用 JournalNode 和配置的

变化。

注意，这个向导会重启所有的集群服务。

前提要求：

   集群上必须启用了 NameNode 高可用性

管理集群的 JournalNodes

   (1) 在 Ambari Web, 选择 Services > HDFS > Summary.

   (2) 单击 Service Actions, 然后单击 Manage JournalNodes

   (3)   在 Assign JournalNodes 页面，通过 + 和 - 图标分配，并从下拉式菜单选择主机名称。完成主机分配之后，单击 Next

   (4) 在 Review 页面，验证 JournalNodes 主机分配及其相关配置修改。满意之后，单击 Next

   (5) 利用远程 shell, 完成 Save Namespace 页面的步骤。成功创建一个检查点后，单击 Next

   (6) 在 Add/Remove JournalNodes 页面，监控进度条，然后单击 Next

   (7) 跟随 Manual Steps Required: Format JournalNodes 页面指导，然后单击 Next

   (8) 在远程 shell 中，确认要初始化 JournalNodes, 在如下提示下输入 Y

       Re-format filesystem in QJM to [host.ip.address.1, host.ip.address.2, host.ip.address.3,] ? (Y or N) Y

   (9) 在 Start Active NameNodes 页面，服务重启时监控进度条，然后单击 Next

   (10)在 Manual Steps Required: Bootstrap Standby NameNode 页面，利用页面上的指导完成每一步骤，然后单击 Next

   (11)在远程 shell 中，确认要 bootstrap 备用 NameNode, 在下列提示中输入 Y



       RE-format filesystem in Storage Directory /grid/0/hadoop/hdfs/namenode ? (Y or N) Y

   (12)在 Start All Services 页面，向导启动所有服务时监控进度条，然后单击 Done 结束向导。

       Ambari Web UI 重新载入后，会看到一些警报通知，等几分钟直到所有服务重新启动并且警报清除

   (13)如有必要，利用 Ambari Web UI 重启任何组件

5.2 ResourceManager 高可用性 (ResourceManager High Availability)

如果工作于 HDP 2.2 或更高版本环境，可以通过 Enable ResourceManager HA 为 ResourceManager 配置高可用性。

前提要求：

   ● 集群必须至少有三部主机

   ● 至少有三个 ZooKeeper server 运行

5.2.1 配置 ResourceManager 高可用性 (Configure ResourceManager High Availability)

访问向导并配置 ResourceManager 高可用性

   ①   在 Ambari Web, 浏览到 Services > YARN > Summary

   ②   选择 Service Actions 然后选择 Enable ResourceManager HA.

       Enable ResourceManager HA 向导启动，描述一系列必须设置 ResourceManager 高可用性的自动和手动步骤

   ③   在 Get Started 页面，阅读启用 ResourceManager HA 概述，然后单击 Next 继续

   ④   在 Select Host 页面，接受默认选择，或选择一可用主机，然后单击 Next 继续

   ⑤   在 Review Selections 页面，如有必要展开 YARN, 概览所有对 YARN 推荐的配置变化。单击 Next 同意修改并自动配置 ResourceManager HA

   ⑥   在 Configure Components 页面，当所有进度条结束时，单击 Complete

5.2.2 禁用 ResourceManager 高可用性 (Disable ResourceManager High Availability)

要禁用 ResourceManager 高可用性，必须删除一个 ResourceManager 并保留一个 ResourceManager. 在要求利用 Ambari API 来修改集群配置来删除

ResourceManage 并利用 ZooKeeper 客户端更新 znode 权限。

前提准备：

   由于这些步骤包括使用 Ambari REST API, 应该提前在一个测试环境中测试并验证它们，再到生产环境执行。

禁用 ResourceManager 高可用性

   (1)   在 Ambari Web, 停止 YARN 和 ZooKeeper 服务

   (2)   在 Ambari Server 主机上，利用 Ambari API 获取 YARN 配置信息到一个 JSON 文件

       /var/lib/ambari-server/resources/scripts/configs.py get <ambari.server> <cluster.name> yarn-site yarn-site.json

       本例中，ambari.server 是 Ambari Server 主机名，cluster.name 是集群的名称

   (3)   在 yarn-site.json 文件中，修改 change yarn.resourcemanager.ha.enabled 为 false, 并删除如下属性：

       • yarn.resourcemanager.ha.rm-ids

       • yarn.resourcemanager.hostname.rm1

       • yarn.resourcemanager.hostname.rm2

       • yarn.resourcemanager.webapp.address.rm1

       • yarn.resourcemanager.webapp.address.rm2

       • yarn.resourcemanager.webapp.https.address.rm1

       • yarn.resourcemanager.webapp.https.address.rm2

       • yarn.resourcemanager.cluster-id

       • yarn.resourcemanager.ha.automatic-failover.zk-base-path

   (4)   验证 yarn-site.json 文件中保留下列属性设置为 ResourceManager 主机名

       • yarn.resourcemanager.hostname

       • yarn.resourcemanager.admin.address

       • yarn.resourcemanager.webapp.address

       • yarn.resourcemanager.resource-tracker.address

       • yarn.resourcemanager.scheduler.address

       • yarn.resourcemanager.webapp.https.address

       • yarn.timeline-service.webapp.address

       • yarn.timeline-service.webapp.https.address

       • yarn.timeline-service.address

       • yarn.log.server.url

   (5)   搜索 yarn-site.json 文件，并删除任何对要删除的 ResourceManage 主机名的引用

   (6) 搜索 yarn-site.json 文件，并删除任何仍设置为 ResourceManager IDs 的属性，例如 rm1 and rm2

   (7) 保存 yarn-site.json 文件，并设置到 Ambari server

       /var/lib/ambari-server/resources/scripts/configs.py set ambari.server cluster.name yarn-site yarn-site.json

   (8)   利用 Ambari API, 删除要删除的 ResourceManager 主机组件

       curl --user admin:admin -i -H "X-Requested-By: ambari" -X DELETE http://ambari.server:8080/api/v1/clusters/cluster.name/hosts/   \

       hostname/host_components/RESOURCEMANAGER

   (9)   在 Ambari Web 中，启动 ZooKeeper 服务

   (10)在一个安装了 ZooKeeper client 的主机上，使用 ZooKeeper client 修改 znode 许可权限：

       /usr/hdp/current/zookeeper-client/bin/zkCli.sh

       getAcl /rmstore/ZKRMStateRoot

       setAcl /rmstore/ZKRMStateRoot world:anyone:rwcda

   (11)在 Ambari Web, 重启 ZooKeeper 服务并启动 YARN 服务。

5.3 HBase 高可用性 (HBase High Availability)

为了在生产环境中帮助实现高可用性冗余。 Apache HBase 支持在集群中部署多个 Master. 如果工作于 Hortonworks Data Platform (HDP) 2.2 或更高版本

环境，Apache Ambari 通过简单的设置实现多个 HBase Masters

在 Apache HBase 服务安装期间和取决于组件分配，Ambari 安装并配置一个 HBase Master 组件以及多个 RegionServer 组件。为了配置 HBase 服务的高

可用性，可以运行两个或更多的 HBase Master 组件。HBase 利用 Zookeeper 来协调集群中运行的两个或多个 HBase Master 其中的活动 Master. 这意味着

当 primary HBase Master 失效时，客户端会自动被转移到 secondary Master.

   ● 通过 Ambari 设置多个 HBase Masters (Set Up Multiple HBase Masters Through Ambari)

   Hortonworks 建议使用 Ambari 来配置多个 HBase Master. 完成如下任务：

   ● 向新创建集群添加第二 HBase Master (Add a Secondary HBase Master to a New Cluster)

   在安装 HBase 时，单击显示在已选中的 HBase Master 右侧的 + 符号图标添加并选择一个节点来部署第二个 HBase Master

   ● 向已存在集群添加新的 HBase Master (Add a New HBase Master to an Existing Cluster)

   ①   以集群管理员账号登录到 Ambari 管理 UI

   ②   在 Ambari Web, 浏览到 Services > HBase.

   ③   在 Service Actions, 单击 + Add HBase Master

   ④   选要安装 HBase master 的主机，然后单击 Confirm Add.

   Ambari 安装这个新的 HBase Master 并识别 HBase 来管理多个 Master 实例

   ● 手动设置多个 HBase Masters (Set Up Multiple HBase Masters Manually)

   在手动配置多个 HBase Masters 之前，必须根据安装过程中的指导配置集群上的第一个节点(node-1)，然后完成下面的任务：

   ①   配置无密码 SSH 访问

   ②   准备 node-1

   ③   准备 node-2 和 node-3

   ④   启动并配置 HBase 集群

   ● 配置无密码 SSH 访问 (Configure Passwordless SSH Access)

   集群上的第一个节点(node-1)必须能登录到集群到其它主机，并且然后可以再登录回自己来启动守护进程。可以在所有主机上使用同一用户名并使用

   无密码 SSH 登录来达成此目的。

   ①   在 node-1 上，停止 HBase 服务

   ②   在 node-1 上，以 HBase 用户登录并生成 SSH key 对

       $ ssh-keygen -t rsa

       系统打印出 key 对的存储位置，默认的公钥为 id_rsa.pub

   ③   在其他节点上创建目录来保存公钥

       在 node-2 上，以 HBase 用户登录主机并在用户主目录创建 .ssh/ 目录

       在 node-3 上，重复这一过程

   ④   利用 scp 或其他标准安全工具从 node-1 上复制公钥到其它两个节点

       在每个节点上创建一个新文件 .ssh/authorized_keys 并把 id_rsa.pub 文件内容添加到这个文件中

       $ cat id_rsa.pub >> ~/.ssh/authorized_keys

       确保不是复写到 .ssh/authorized_keys 文件。

   ⑤   从 node-1 以同一个用户名使用 SSH 登录其它节点。应该不会提示输入密码

   ⑥   在 node-2 节点，重复第五步，因为它作为一个备份 Master 运行

   ● 准备 node-1 (Prepare node-1)

   因为 node-1 要作为 primary Master 和 ZooKeeper 进程运行，必须停止 node-1 上启动的 RegionServer

   ①   编辑 conf/regionservers 文件移除包含 localhost 的行，并为 node-2 和 node-3 添加主机名或 IP 地址

       Note：

           如果想要在 node-1 上运行 RegionServer, 应通过主机名指向它，其他服务器可以用来与之通信。如对于 node-1, 用作 node-1.test.com

   ②   配置 HBase 使用 node-2 作为一个备份 Master, 通过在 conf/ 下创建一个新文件，称为 backup-Masters, 在文件内用 node-2 的主机名添加

       一行，如 node-2.test.com

   ③   在 node-1 上通过编辑 conf/hbase-site.xml 来配置 ZooKeeper, 添加如下属性：

       <property>

           <name>hbase.zookeeper.quorum</name>

           <value>node-1.test.com,node-2.test.com,node-3.test.com</value>

       </property>

       <property>

           <name>hbase.zookeeper.property.dataDir</name>

           <value>/usr/local/zookeeper</value>

       </property>

       这个配置指示 HBase 在集群的每个节点上启动并管理一个 ZooKeeper 实例

   ④   修改配置中每个以 localhost 引用到 node-1 的配置指向到主机名，例如，node-1.test.com

   ● 准备 node-2 和 node-3 (Prepare node-2 and node-3)

   在准备 node-2 和 node-3 之前，每个节点必须有相同的配置信息

   node-2 运行为一个被非法 Master 服务器和一个 ZooKeeper 实例

   ①   在 node-2 和 node-3 上下载并解包 HBase

   ②   复制 node-1 上的配置文件到 node-2 和 node-3

   ③   复制 conf/ 目录的内容到 node-2 和 node-3 的 conf/ 目录

   ● 启动并测试 HBase 集群 (Start and Test your HBase Cluster)

   ①   使用 jps 命令确保 HBase 没有运行

   ②   杀掉 HMaster, HRegionServer, 以及 HQuorumPeer 进程，如果他们正在运行

   ③   在 node-1 上通过运行 start-hbase.sh 启动集群。必须有类似如下的输出：

       $ bin/start-hbase.sh

       node-3.test.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-

       hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-3.test.com.out

       node-1.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.

       3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-1.test.com.out

       node-2.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.

       3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-2.test.com.out

       starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/

       hbase-hbuser-master-node-1.test.com.out

       node-3.test.com: starting regionserver, logging to /home/hbuser/hbase-0.98.

       3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-3.test.com.out

       node-2.test.com: starting regionserver, logging to /home/hbuser/hbase-0.98.

       3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-2.test.com.out

       node-2.test.com: starting master, logging to /home/hbuser/hbase-0.98.3-

       hadoop2/bin/../logs/hbase-hbuser-master-node2.test.com.out

       ZooKeeper 首先启动，然后是 Master, 然后是 RegionServer, 最后是 backup Masters

   ④   在每一个节点上运行 jps 命令来验证每一个服务器上运行了正确的进程

       可能看到额外的 Java 进程也运行在服务器上，如果它们也用于其他目的



       Example1. node-1 jps Output

       $ jps

       20355 Jps

       20071 HQuorumPeer

       20137 HMaster

       Example 2. node-2 jps Output

       $ jps

       15930 HRegionServer

       16194 Jps

       15838 HQuorumPeer

       16010 HMaster

       Example 3. node-3 jps Output

       $ jps

       13901 Jps

       13639 HQuorumPeer

       13737 HRegionServer

       ZooKeeper 进程名

       Note：

           HQuorumPeer 进程是 ZooKeeper 实例，由 HBase 控制和启动。如果以这种方式使用 ZooKeeper，受限制为每个集群节点一个实例，并且

           只适用于测试。如果 ZooKeeper 运行在 HBase 之外，进程叫做 QuorumPeer.

   ⑤   浏览到 Web UI 并测试新的连接

       应该可以连接到 Master UI http://node-1.test.com:16010/

       或者 secondary master    http://node-2.test.com:16010/

       可以在 16030 端口看到每一个 RegionServer 的 web UI

5.4 Hive 高可用性 (Hive High Availability)

Apache Hive 服务有多个相关联的组件。主要的 Hive 组件是 Hive Metastore 和 HiveServer2. 可以在 HDP 2.2 或以后版本中为 Hive 服务配置高

可用性，运行两个或更多的相关组件。

5.4.1 添加 Hive Metastore (Adding a Hive Metastore Component)

前提准备：

   如果 Hive 中有 ACID 启用，确保 Run Compactor 设置时启用的(设置为 True) on only one Hive metastore 主机

步骤：

   ①   在 Ambari Web, 浏览到 Services > Hive

   ②   在 Service Actions, 单击 + Add Hive Metastore 选项

   ③   选取要安装另外的 Hive Metastore 的主机，然后单击 Confirm Add

   ④   Ambari 安装组件并重新配置 Hive 来处理多个 Hive Metastore 实例

5.4.2 添加 HiveServer2 组件 (Adding a HiveServer2 Component)

步骤：

   ①   在 Ambari Web，浏览到要安装另一个 HiveServer2 组件的主机

   ②   在 Host 页，单击 +Add.

   ③   从列表中单击 HiveServer2

   Ambari 安装另外的 HiveServer2

5.4.3 添加 WebHCat Server (Adding a WebHCat Server)

步骤：

   ①   在 Ambari Web，浏览到要安装另一个 WebHCat 服务器的主机

   ②   在 Host 页，单击 +Add.

   ③   从列表中单击 WebHCat

   Ambari 安装新服务器并重新配置组 Hive

5.5 Storm 高可用性 (Storm High Availability)

HDP 2.3 及以后版本，可以通过在 Ambari 上添加 Nimbus 组件配置 Apache Storm Nimbus 服务器高可用性。

5.5.1 添加一个 Nimbus 组件 (Adding a Nimbus Component)

步骤：

   ①   在 Ambari Web, 浏览到 Services > Storm

   ②   在 Service Actions, 单击 + Add Nimbus 选项

   ③   单击要安装另外的 Nimbus 的主机，然后单击 Confirm Add

   Ambari 安装组件并重新配置 Storm 来处理多个 Nimbus 实例

5.6 Oozie 高可用性 (Oozie High Availability)

HDP 2.2 及以后版本，可以设置 Apache Oozie 服务的高可用性，可以运行两个或多个 Oozie Server 组件。

前提准备：

   ● 使用默认安装的 Derby 数据库实例不支持多 Oozie Server 实例，因此必须使用已有的关系数据库。当使用 Apache Derby 为 Oozie Server 提供

       数据库时，没有添加 Oozie Server 组件到集群中的选项

   ● 对 Oozie 高可用性要求使用外部虚拟 IP 地址(an external virtual IP address) 或负载均衡器(load balancer) 将流量转发给多个 Oozie 服务器。

5.6.1 添加一个 Oozie 服务器组件 (Adding an Oozie Server Component)

步骤：

   (1)   在 Ambari Web, 浏览到要安装另一个 Oozie server 的主机

   (2)   在 Host 页, 单击 +Add 按钮

   (3)   从列表中单击 Oozie server

   (4)   配置外部负载均衡器，然后更新 Oozie 配置

   (5) 浏览到 Services > Oozie > Configs

   (6)   在 oozie-site, 添加如下熟悉值：

       oozie.zookeeper.connection.string

       列出 ZooKeeper 主机，带有端口，例如：

       c6401.ambari.apache.org:2181,

       c6402.ambari.apache.org:2181,

       c6403.ambari.apache.org:2181

       oozie.services.ext

       org.apache.oozie.service.ZKLocksService,

       org.apache.oozie.service.ZKXLogStreamingService,

       org.apache.oozie.service.ZKJobsConcurrencyService

       oozie.base.url

       http://<Cloadbalancer.hostname>:11000/oozie

   (7)   在 oozie-env 中，撤销 oozie_base_url 属性注释，并修改它的值指向负载均衡器：

       export oozie_base_url="http://<loadbalance.hostname>:11000/oozie"

   (8)   重启 Oozie

   (9)   为 Oozie proxy user 更新 HDFS 配置属性

       a. 浏览到 Services > HDFS > Configs

       b. 在 core-site 中，更新 hadoop.proxyuser.oozie.hosts 属性，包含新添加的 Oozie server 主机。使用逗号分隔的多个主机名

   (10)重启服务

5.7 Apache Atlas 高可用性 (Apache Atlas High Availability)

步骤：

   (1)   在 Ambari 表盘上，单击 Hosts, 然后选择要安装备用 Atlas Metadata Server 的主机

   (2)   在新 Atlas Metadata Server 主机的 Summary 页面，单击 Add > Atlas Metadata Server

       Ambari 添加新的 Atlas Metadata Server 为 Stopped 状态

   (3) 单击 Atlas > Configs > Advanced

   (4)   单击 Advanced application-properties 并添加 atlas.rest.address 属性，使用逗号分隔，值为新的 Atlas Metadata Server：

       ,http(s):<host_name>:<port_number>

       默认协议是 "http", 如果 atlas.enableTLS 属性设置为 true, 使用 "https". 同时，默认的 HTTP 端口为 21000, 并且默认额 HTTPS 端口为 21443

       这些值可以分别使用 atlas.server.http.port 和 atlas.server.https.port 属性覆盖

   (5)   停止所有当前正在运行的 Atlas Metadata Servers

       重要提示：

           必须使用 Stop 命令来停止 Atlas Metadata Servers . 不要使用 Restart 命令：这会尝试首先停止新创建的 Atlas Server, 此时在

           /etc/atlas/conf 中还没有包含任何配置信息

   (6)   在 Ambari 表盘上, 单击 Atlas > Service Actions > Start

       Ambari 会自动配置 Atlas 在 /etc/atlas/conf/atlas-application.properties 文件中如下属性：

       • atlas.server.ids

       • atlas.server.address.$id

       • atlas.server.ha.enabled

   (7)   要刷新配置文件，重启如下含有 Atlas hooks 的服务：

       • Hive

       • Storm

       • Falcon

       • Sqoop

       • Oozie

   (8)   单击 Actions > Restart All Required 来重启所有要求重启的服务

       当在 Ambari 中更新了 Atlas 的配置设置， Ambari 标记了要求重启的服务

   (9)   单击 Oozie > Service Actions > Restart All 以重启 Oozie 以及其相关服务

       Apache Oozie 在 Atlas 配置更新之后要求重启，但有可能没有包含到 Ambari 标记要求重启的服务中

5.8 启用 Ranger Admin 高可用性 (Enabling Ranger Admin High Availability)

在 Ambari 管理的集群上，可以配置 Ranger Admin 高可用性带有或不带有 SSL 。

步骤：

   ● HTTPD setup for HTTP - 在 Ambari 中启用 Ranger Admin HA, 从第 16 步开始：

   https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_hadoop-high-availability/content/configure_ranger_admin_ha.html       \

   #configure_ranger_admin_ha_without_ssl

   ● HTTPD setup for HTTPS - 在 Ambari 中启用 Ranger Admin HA, 从第 14 步开始

   https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_hadoop-high-availability/content/configure_ranger_admin_ha.html       \

   #configure_ranger_admin_ha_with_ssl

6 管理配置 (Managing Configurations)

可以通过调整配置设置和属性值来优化集群上的 Hadoop 组件的性能。也可以利用 Ambari Web 通过如下方法，来建立和管理配置分组及配置设置的版本：

   • Changing Configuration Settings

   • Manage Host Config Groups

   • Configuring Log Settings

   • Set Service Configuration Versions

   • Download Client Configuration Files

6.1 修改配置设置 (Changing Configuration Settings)

可以通过每一个服务的 Configs 页面优化服务性能。Configs 页面包含几个选项卡，用于管理配置版本，分组，设置，属性和值。可以调整设置，称为

"Smart Configs" 在宏级别(macro-level) 进行控制，每个服务的内存分配。调整 Smart Configs 要求相关配置的设置修改整个集群范围。Ambari 提示检验

并确认所有建议的修改并重启相关服务。

步骤：

   ①   在 Ambari Web 中，在左侧的服务列表上单击服务名称

   ②   从服务的 Summary 页面，单击 Configs 选项卡，然后利用如下选项卡管理配置设置

       利用 Configs tab 管理配置版本和分组

       利用 Settings tab 管理 "Smart Configs", 通过调整绿色的滑动按钮

       利用 Advanced tab 编辑特殊配置属性和值

   ③   单击 Save

6.1.1 调整智能配置设置 (Adjust Smart Config Settings)

利用 Settings tab 管理 "Smart Configs", 通过调整绿色滑动按钮

步骤：

   ① 在 Settings tab, 单击并拖拽绿色滑动按钮到理想值

   ② 编辑显示为 Override 选项的属性

   ③ 单击 Save

6.1.2 编辑特定属性 (Edit Specific Properties)

利用每个服务 Configs 页面的 Advanced tab 访问影响该服务性能的属性组

步骤：

   ① 在服务的 Configs 页面，单击 Advanced

   ② 在 Configs Advanced 页面，展开类别

   ③ 编辑属性值

   ④ 单击 Save

6.1.3 检验并确认配置修改 (Review and Confirm Configuration Changes)

当修改了一个配置属性值是，Ambari Stack Advisor 捕捉到修改，并建议修改受此修改影响的所有相关的配置属性。修改一个属性，一个 "Smart

Configuration", 以及其他动作，例如添加或删除一个服务、主机或 ZooKeeper server, 或迁移一个 master, 或者启用一个组件的高可用性，所有要求检验

(review)并确认相关配置的修改。例如，如果提升 YARN 的 Minimum Container Size (Memory), Dependent Configurations 列出所有建议的修改，对此必须

检验(review) 并(可选地)接受(accept)。

修改的类型突出显示为如下颜色：

   值修改               ：黄色

   添加的属性          ：绿色

   删除的属性            ：红色

检验并确认配置属性修改

步骤：

   ①   在 Dependent Configurations, 对于每个列出的属性检验摘要信息

   ②   如果这个修改可以接受，继续检验列表中的下一条属性

   ③   如果这个修改不可接受，单击属性前边的蓝色复选框标记

       单击复选框标记会清除复选框，清除复选框的修改是没有确认的，并且也不会发生修改

   ④   检验所有列出的修改之后，单击 OK 以确认所有标记的修改会发生

6.1.4 重启组件 (Restart Components)

编辑并保存配置修改之后，一个 Restart 指示器会出现在组件旁边要重启以利用更新的配置值

   ①   单击指示的 Components 或 Hosts 链接来查看有关请求重启的细节

   ②   单击 Restart 然后单击适宜的动作

6.2 管理主机配置分组 (Manage Host Config Groups)

Ambari 初始将所有安装的服务分配集群上所有主机到一个默认的配置分组。例如，使用默认配置部署一个三个节点的集群，HDFS 服务的每个主机都属于一个

具有默认配置设置信息的配置组。

   ● 管理配置分组：

   ①   单击服务名称，然后单击 Configs

   ②   在 Configs 页面，单击 Manage Config Groups

   ● 要创建一个新配置组，重新分配主机，并覆盖主机组件的默认设置，可以利用 Manage Configuration Groups 控件：

   ①   在 Manage Config Groups 中, 单击 Create New Configuration Group 的 + 符号按钮

   ②   命名并描述配置组的名称，然后选择 OK

   ● 向新的配置组中添加主机

   ①   在 Manage Config Groups 中，单击配置组名称

   ②   单击 Add Hosts to selected Configuration Group + 符号按钮

   ③   利用 Select Configuration Group Hosts, 单击 Components, 然后从列表中单击一个组件名称



       选取一个组件过滤主机列表，只有所选服务组件存在的主机会列出。要进一步过滤可用主机名称列表，可以利用 Filter 的下拉列表。默认情况系，

       主机列表通过 IP 地址过滤

   ④   过滤主机列表之后，单击每个要包含进配置分组主机的复选框

   ⑤   单击 OK

   ⑥   在 Manage Configuration Groups 中，单击 Save

   ● 编辑配置分组设置

   ①   在 Configs, 单击组名称

   ②   单击一个 Config Group, 展开组件找到允许 Override 的设置

   ③   提供一个默认值，然后单击 Override 或 Save

       配置组强制配置属性允许覆盖，取决于所选服务和组安装的组件

   ④   Override 提示选取如下选项之一：

       a. 或者单击一个已存在配置组的名称，属性值被第三步提供的值覆盖

       b. 或者创建一个新的配置组，包含默认值，加上被第三步提供的值覆盖的值

       c. 单击 OK.

   ⑤   单击 Save

6.3 配置日志设置 (Configuring Log Settings)

Ambari 利用 Log4j properties 属性集控制 Hadoop 集群上运行的每一个服务的日志活动。最初，每个属性的默认值在 <service_name>-log4j template

模板文件中。Log4j 的属性和值限制了日志文件的大小和日志文件备份的数量，每个服务会超过 log4j 模板文件的设置。要访问每个服务默认的 Log4j 设置，

在 Ambari Web 中，浏览到 <Service_name> > Configs > Advanced <service_name>-log4j

   ● 修改一个服务的日志文件大小和备份数量：

   ①    编辑 <service_name> backup file size 以及 <service_name> # of backup files 属性值

   ②    单击 Save

   ● 自定义一个服务的 Log4j 设置：

   ①   在 <service_name> log4j template 中编辑属性

   ②   复制 log4j 模板文件内容

   ③   浏览到 custom <service_name>log4j 属性组

   ④   将复制到内容粘贴到 custom <service_name>log4j properties, 覆盖掉默认掉内容

   ⑤   单击 Save

   ⑥   提示时，检验并确认建议的配置修改

   ⑦   如果提示，重启受影响的服务

   重启服务中的组件会推送显示在 Custom log4j.properites 中的配置属性到每一部运行该服务组件的主机。

   如果自定义了日志属性，定义每个服务怎样的活动记入日志，需要刷新每个服务名称前的指示器。确保显示在 Custom logj4.properties 中的日志属性

   包含自定义信息。

   可选地，可以创建配置组来包含自定义日志属性。

6.4 设置服务配置版本 (Set Service Configuration Versions)

Ambari 可以管理配置相关的服务。可以修改配置信息，查看修改历史，比较并恢复修改，以及推送配置变化到集群主机

6.4.1 基本概念 (Basic Concepts)

理解 Ambari 中服务配置如何组织和存储非常重要。属性分组成配置类型，一系列配置类型组成了一个服务的配置集。

例如， Hadoop Distributed File System (HDFS) 服务包括 hdfs-site, coresite, hdfs-log4j, hadoop-env, and hadoop-policy 配置类型。如果浏览到

Services > HDFS > Configs, 可以编辑这些配置类型的配置属性。

Ambari 在服务级别执行配置版本化。因此，当在一个服务上修改一个配置属性时，Ambari 创建一个服务配置版本。

6.4.2 术语 (Terminology)

配置属性(configuration property)       : 配置属性由 Ambari 管理，例如 NameNode 堆大小和复制因子

配置类型(configuration type, config type): 配置属性的组，例如，hdfs-site

服务配置(service configurations)       : 特定服务的配置类型集，例如，hdfs-site 和 core-site 作为 HDFS 服务配置的一部分

修改注释(change notes)                   ：作为服务配置修改可选的注释

服务配置版本(service config version, SCV)   : 特定服务的一个配置版本

主机配置组(host config group, HCG)       : 一系列配置属性应用到一个特定的主机集合

6.4.3 保存修改 (Saving a Change)

   ①   在 Configs, 修改某一配置属性的值

   ②   选择 Save

   ③   可选地，输入描述修改地注释

   ④   单机 Cancel 继续编辑，单击 Discard 保持控件没有任何修改，或者单击 Save 确认修改

6.4.4 查看历史 (Viewing History)

Ambari Web 中，可以在两个位置查看配置变化历史：Dashboard 页面的 Config History tab, 和每个服务页面的 Configs tab

Dashboard > Config History tab 页面显示一个所有服务所有版本的表格，每个版本的号码和创建的时间日期。也可以看到是哪个用户修改的配置，以及修改

的注释。使用这个表格，可以过滤，排序，以及搜索版本。

Service > Configs tab 页面只显示最近配置的修改，当然也可以使用版本滚动条查看更早版本。利用这个选项卡可以快速访问服务最近的配置修改

利用这个视图，可以单击滚动条内的任何版本来查看，也可以将鼠标指针悬停在版本上以显示一个选项菜单，可以进行版本比较和执行恢复操作，可以选定

任何一个最为当前版本。

6.4.5 比较版本 (Comparing Versions)

当在 Services > Configs tab 页面浏览版本滚动时，可以将鼠标指针悬停在版本上显示 view, compare, or revert (make current) 选项。

比较两个服务配置版本：

   ①   导航到某个配置版本，如 V6

   ②   利用版本滚动条，找到要与 V6 进行比较到版本，利润 V2

   ③   将鼠标指针悬停在 V2 上显示选项菜单，然后单击 Compare.

Ambari 显示 V6 和 V2 的比较，伴随一个 revert to V2 ((Make V2 Current) 的选项。Ambari 也在 Filter 控件新，通过 Changed properties 过滤显示

6.4.6 恢复修改 (Reverting a Change)

通过 Make Current 特性可以恢复到一个旧的服务配置版本。Make Current 从选择恢复的版本上，创建一个新的服务配置版本，效果上，相当于一个克隆

启动 Make Current 操作后，在 Make Current Confirmation 提示上，输入注释并保存(Make Current)

有多种方法可以恢复到一个之前的配置版本：

   ● 查看一个特定的版本，然后单击 Make V* Current:

   ● 使用版本导航，然后单击 Make Current

   ● 将鼠标指针悬停到版本滚动条中到一个版本，然后单击 Make Current

   ● 执行版本比较，然后单击 Make V* Current

6.4.7 主机配置组 (Host Config Groups)

服务配置版本作用域范围是到一个主机配置组。例如，在默认组中的修改可以在那个配置组中被比较和恢复，自定义组中也应用同样的方式。

6.5 下载客户端配置文件 (Download Client Configuration Files)

客户端配置文件包括：.xml 文件, env-sh 脚本, 以及 log4j 属性用于配置 Hadoop 服务。对于包括客户端组件的服务(大多数服务，除了 SmartSense 和

Ambari Metrics 服务)，可以下载与那个服务相关联的客户端配置文件。也可以下载整个集群的客户端配置文件作为一个存档文件。

● 为单一服务下载客户端配置文件：

步骤：

   ①   在 Ambari Web 中，浏览到想要配置到服务

   ②   单击 Service Actions

   ③   单击 Download Client Configs

   浏览器下载一个 "tarball" 存档文件只包含选定服务的客户端配置文件到浏览器默认的，本地下载目录

   ④   如果提示保存或打开客户端配置文件

   ⑤   单击 Save File, 然后单击 OK

● 要为整个集群下载所有客户端配置文件

   ①   在 Ambari Web, 在服务列表底部单击 Actions

   ②   单击 Download Client Configs

   浏览器下载一个 "tarball" 存档文件包含集群所有客户端配置文件到浏览器默认的，本地下载目录

7 管理集群 (Administering the Cluster)

利用 Ambari Web Admin 选项：

任何用户(any user)                   : 可以查看有关安装栈和加入其中的每个服务版本的信息

集群管理员(Cluster administrators)   : 能够

                                   • 启用 Kerberos 安全性

                                   • 重新生成 key tabs

                                   • 查看服务用户帐号的名称和值

                                   • 启用服务的自动启动

Ambari administrators               ：能够

                                   • 添加新服务到安装栈

                                   • 升级安装栈到一个新的版本

7.1 利用安装栈和版本信息 (Using Stack and Versions Information)

Stack tab 包含有关集群栈中已安装和可用的服务。任何用户都可以浏览服务列表。作为 Ambari 系统管理员，可以单击 Add Service 来启动向导来安装

服务到集群中。

Versions tab 包含有关哪个版本的软件当前已安装并运行在集群中的信息。作为集群管理员，可以在此页启动一次自动集群更新。

7.2 查看服务账号 (Viewing Service Accounts)

作为集群管理员，可以查看集群服务的服务用户和用户组账号列表。

在 Ambari Web UI > Admin, 单击 Service Accounts

7.3 启用 Kerberos 和重新生成 Keytabs (Enabling Kerberos and Regenerating Keytabs)

作为集群管理员，可以在集群上启用并管理 Kerberos 安全性。

前提准备：

   在集群上启用 Kerberos 之前，必须为集群做好准备，如下列新所描述：

   https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.1.5/bk_ambari-security/content/ch_configuring_amb_hdp_for_kerberos.html

步骤：

   在 Ambari web UI > Admin 菜单，单击 Enable Kerberos 启动 Kerberos 向导

Kerberos 启用之后，可以在 Ambari web UI > Admin 菜单，重新生成 key tabs 以及禁用 Kerberos

7.3.1 重新生成 Keytabs (Regenerate Key tabs)

作为集群管理员，可以再生维护 Kerberos 安全性要求的 key tabs

前提准备：

   再生 key tabs 之前：

   ● 集群必须 Kerberos-enabled

   ● 必须有 KDC Admin 凭证

步骤：

   ①   浏览到 Admin > Kerberos

   ②   单击 Regenerate Kerberos.

   ③   确认选择

   ④   Ambari 连接到 Kerberos Key Distribution Center (KDC) 并为服务和集群到 Ambari 负责人再生 key tabs. 可选地，可以只为那些丢失连 key

       tab 的主机生成 key tab, 例如，为那些在 Ambari 启用 Kerberos 时不在线或不可用的主机再生。

   ⑤   重启所有服务

7.3.2 禁用 Kerberos (Disable Kerberos)

作为集群管理员，可以在集群上禁用 Kerberos

前提：

   禁用 Kerberos 安全性之前，集群必须已经是 Kerberos-enabled

步骤：

   ①   浏览到 Admin > Kerberos

   ②   单击 Disable Kerberos

   ③   确认选择

       集群服务停止，并且 Ambari Kerberos 安全性设置重置

   ④ 要重新启用 Kerberos, 单击 Enable Kerberos 并跟随向导







7.4 启用服务自动启动 (Enable Service Auto-Start)

作为集群管理员或集群操作员，可以启用安装栈内每一个服务自动重启。一个服务启用了 auto-start 会使 ambari-agent 不需要用户手工作用重新启动

停止状态的服务组件。auto-start 服务默认是启用的，但只有 Ambari Metrics Collector 组件默认设置为 auto-start。

作为第一步，应该在核心 Hadoop 服务的工作节点上启用 auto-start, 例如 YARN 和 HDFS 的 DataNode 以 NameNode 组件。也应该在 SmartSense 服务中

为所有组件启用 auto-start. 启用 auto-start 之后，在 Ambari Web 表盘中监控服务的操作状态。Auto-start 不会尝试显示为后台操作。诊断服务组件的

失败启动，检查 ambari agent 的日志文件，位于组件主机的 /var/log/ambari-agent.log

管理一个服务的组件 auto-start 状态

步骤：

   ①   在 Auto-Start Services 上，单击一个服务名称

   ②   在 Auto-Start Services 控件的至少一个组件，单击灰色区域，使其状态变为 Enabled

       服务名称右侧的绿色图标指示该服务启用了 auto-start 的组件的百分比

   ③   要启用服务的所有组件为 auto-start, 单击 Enable All

       绿色图标填满指示该服务的所有组件启用了 auto-start

   ④   要禁用服务所有组件的 auto-start, 单击 Disable All

       绿色图标清空指示该服务的所有组件禁用了 auto-start

   ⑤   要清除所有未定的状态改变，在保存它们之前，单击 Discard

   ⑥   结束修改 auto-start 状态设置时，单击 Save.

禁用服务当 auto-start :

   ①   在 Ambari Web, 单击 Admin > Service Auto-Start

   ②   在 Service Auto Start Configuration 中, 在 Auto-Start Services 控件上，单击灰色区域，使其状态由 Enabled 变为 Disabled

   ③   单击 Save

8 启用服务自动启动 (Managing Alerts and Notifications)

Ambari 为每一个集群组件和主机使用一套预定义的七种类型的警报(web, port, metric, aggregate, script, server, and recovery). 可以利用这些警报

监控集群健康情况，以及向其他用户报警以帮助识别和处理故障问题。可以修改警报的名称，描述，以及检查周期，也可以禁用以及重新启用警报。

也可以创建一组警报并设置通知目标给每个用户组，这样就可以使用不同的方法通知不同的警报集给不同的用户组。

8.1 理解警报 (Understanding Alerts)

Ambari 预定义了一系列警报来监控集群组件和主机。每一个警报由一个警报定义(alert definition)来定义，定义警报类型检查的间隔和阈值。集群创建或

修改时，Ambari 读取警报定义并为指定的项(items)创建警报实例进行监控。例如，如果集群包括 Hadoop Distributed File System (HDFS), 有一个警报

定义用于监控 "DataNode Process". 集群中为每一个 DataNode 创建一个警报定义的实例。

利用 Ambari Web，通过单击 Alert tab 可以浏览集群上警报定义列表。可以通过当前状态，最后状态变化，以及与警报定义相关联的服务，查找或过滤警报

的定义。可以单击 alert definition name 来查看该警报的详细信息，或修改警报属性(如检查间隔和阈值)，以及该警报定义相关联的警报实例列表。

每个警报实例报告一个警报状态，由严重程度定义。最常用的严重级别为 OK, WARNING, and CRITICAL, 也有 UNKNOWN 和 NONE 的严重级别。警报通知在警报

状态发生变化时发送(如，状态从 OK 变为 CRITICAL)。

8.1.1 警报类型 (Alert Types)

警报阈值和阈值的单位取决于警报的状态。下表列出了警报类型，它们可能的状态，以及可以配置什么阈值单位，如果阈值可配置的话

WEB Alert Type           ：WEB 警报监视一个给定组件的 web URL, 警报状态由 HTTP 响应代码确定。因此，不能改变 HTTP 的响应代码来确定 WEB 警报

                       的阈值。可以自定义每个阈值和整个 web 连接超时的响应文本。连接超时被认为是 CRITICAL 警报。阈值单位基于秒。

                       响应代码对应 WEB 警报的状态如下：

                           ● OK status        ：如果 web URL 响应代码低于 400.

                           ● WARNING status   ：如果 web URL 响应代码等于或高于 400.

                           ● CRITICAL status   ：如果 Ambari 不能连接到某个 web URL.



PORT Alert Type           ：PORT 警报检查连接到一个给定端口的响应时间，阈值单位基于秒

METRIC Alert Type       ：METRIC 警报检查一个或多个度量的值(如果执行计算)。度量从一个给定组件上的可用的 URL 端点访问。连接超时被认为是 CRITICAL

                       警报。

                       阈值是可调整的，并且每一个阈值的单位取决于度量。例如，在 CPU utilization 警报的场景下，单位是百分数；在

                       RPC latency 警报的场景下，单位为毫秒。

AGGREGATE Alert Type   ：AGGREGATE 警报聚合警报状态的数量作为受影响警报数量的百分比。例如，Percent DataNode Process 警报聚合 DataNode Process

                       警报。

SCRIPT Alert Type       ：SCRIPT 警报执行某个脚本来确定其状态，例如 OK, WARNING, 或 CRITICAL. 可以自定义响应文本和属性的值，以及 SCRIPT 警报的

                       阈值。

SERVER Alert Type       ：SERVER 警报执行一个服务器侧的可运行类以确定警报状态，例如，OK, WARNING, 或 CRITICAL

RECOVERY Alert Type       ：RECOVERY 警报由 Ambari Agent 处理，用于监控进程重启。警报状态 OK, WARNING, 以及 CRITICAL 基于一个进程自动重启所用时间的

                       数量。这在要了解进程终止并被 Ambari 自动重启时非常有用。

8.2 修改警报 (Modifying Alerts)

警报的通用属性包括名称，描述，检查间隔，以及阈值。

检查间隔定义了 Ambari 检查警报状态的频率。例如，"1 minute" 意思是 Ambari 每分钟检查警报的状态。

阈值的配置选项取决于警报的类型

修改警报的通用属性：

   ①   在 Ambari Web 上浏览到 Alerts 部分

   ②   找到警报到定义并单击以查看定义详细信息

   ③   单击 Edit 来修改名称，描述，检查间隔，以及阈值(如果可用)

   ④   单击 Save

   ⑤   在下一次检查间隔时，在所有警报实例上修改生效

8.3 修改警报检查数量 (Modifying Alert Check Counts)

Ambari 可以设置警报在分发一个通知之前执行检查的数量。如果警报状态在一个检查期间发生了变化，Ambari 在分发通知之前会尝试检查这个条件一定的

次数(check count)。

警报检查次数不适用于 AGGREATE 警报类型。一个状态的变化对于 AGGREATE 警报导致一个通知分发。

如果环境中经常会用短时的问题导致错误的警报，可以提升检查次数。这种情况下，警报状态的变化仍然会记录，但是作为 SOFT 状态变化。如果在一个指定

的检查次数之后警报条件仍然触发，这个状态的变化被认为是 HARD, 并且通知被发出。

通常对所有警报全局设置检查次数，但如果一个或多个警报实践中有短时问题的情况，也可以对单个的警报设置一覆盖全局设定值。

修改全局警报检查次数：

   ① 在 Ambari Web 中浏览到 Alerts 部分

   ② 在 Actions 菜单, 单击 Manage Alert Settings

   ③ 更新 Check Count 值

   ④ 单击 Save

   对全局警报检查次数对修改可能要求几秒钟后出现在 Ambari UI 的单个警报上

为单个警报覆盖全局警报检查次数：

   ① Ambari Web 中浏览到 Alerts 部分

   ② 选择要设置特殊 Check Count 值的警报

   ③ 在右侧，单击 Check Count property 旁的 Edit 图标

   ④ 更新 Check Count 值

   ⑤ 单击 Save

8.4 禁用和再启用警报 (Disabling and Re-enabling Alerts)

可以禁用警报。当一个警报禁用时，没有警报实例生效，并且 Ambari 不在执行该警报的检查。因而，没有警报状态变化会记录，并且没有通知发送。

   ① Ambari Web 中浏览到 Alerts 部分

   ② 找到警报定义，单击文本旁的 Enabled 或 Disabled 以启用/禁用该警报

   ③ 另一方法，单击警报以查看定义的详细信息，然后单击 Enabled 或 Disabled 以启用/禁用该警报

   ④ 提示确认启用/禁用

8.5 预定义的警报 (Tables of Predefined Alerts)

8.5.1 HDFS 服务警报 (HDFS Service Alerts)

□ 警报名称：NameNode Blocks Health

   警报类型   ：METRIC

   描述       ：This service-level alert is triggered if the number of corrupt or missing blocks exceeds the configured critical threshold.

   潜在原因   ：Some DataNodes are down and the replicas that are missing blocks are only on those DataNodes.

               The corrupt or missing blocks are from files with a replication factor of 1. New replicas cannot be created because the

               only replica of the block is missing.

   解决方法   ：For critical data, use a replication factor of 3.

               Bring up the failed DataNodes with missing or corrupt blocks.

               Identify the files associated with the missing or corrupt blocks by running the Hadoop fsck command.

               Delete the corrupt files and recover them from backup, if one exists.

□ 警报名称：NFS Gateway Process

   警报类型   ：PORT

   描述       ：This host-level alert is triggered if the NFS Gateway process cannot be confirmed as active.

   潜在原因   ：NFS Gateway is down.

   解决方法   ：Check for a non-operating NFS Gateway in Ambari Web.

□ 警报名称：DataNode Storage

   警报类型   ：METRIC

   描述       ：This host-level alert is triggered if storage capacity is full on the DataNode (90% critical). It checks the DataNode

               JMX Servlet for the Capacity and Remaining properties.

   潜在原因   ：Cluster storage is full.

               If cluster storage is not full, DataNode is full.

   解决方法   ：If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes.

               If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger

               disks to the DataNodes. After adding more storage, run the load balancer.

□ 警报名称：DataNode Process

   警报类型   ：PORT

   描述       ：This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on

               the network for the configured critical threshold, in seconds.

   潜在原因   ：DataNode process is down or not responding.

               DataNode are not down but is not listening to the correct network port/address.

   解决方法   ：Check for non-operating DataNodes in Ambari Web.

               Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode, if necessary.

               Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.

□ 警报名称：DataNode Web UI

   警报类型   ：WEB

   描述       ：This host-level alert is triggered if the DataNode web UI is unreachable.

   潜在原因   ：The DataNode process is not running.

   解决方法   ：Check whether the DataNode process is running.

□ 警报名称：NameNode Host CPU Utilization

   警报类型   ：METRIC

   描述       ：This host-level alert is triggered if CPU utilization of the NameNode exceeds certain thresholds (200% warning,

               250% critical). It checks the NameNode JMX Servlet for the SystemCPULoad property. This information is available only if

               you are running JDK 1.7.

   潜在原因   ：Unusually high CPU utilization might be caused by a very unusual job or query workload, but this is generally the sign

               of an issue in the daemon.

   解决方法   ：Use the top command to determine which processes are consuming excess CPU.

               Reset the offending process.

□ 警报名称：NameNode Web UI

   警报类型   ：WEB

   描述       ：This host-level alert is triggered if the NameNode web UI is unreachable.

   潜在原因   ：The NameNode process is not running.

   解决方法   ：Check whether the NameNode process is running.

□ 警报名称：Percent DataNodes with Available Space

   警报类型   ：AGGREGATE

   描述       ：This service-level alert is triggered if the storage is full on a certain percentage of DataNodes(10% warn, 30% critical)

   潜在原因   ：Cluster storage is full.

               If cluster storage is not full, DataNode is full.

   解决方法   ：If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes

               If the cluster is full, delete unnecessary data or increase storage by adding either more DataNodes or more or larger disks

               to the DataNodes. After adding more storage, run the load balancer.

□ 警报名称：Percent DataNodes Available

   警报类型   ：AGGREGATE

   描述       ：This alert is triggered if the number of non-operating DataNodes in the cluster is greater than the configured critical

               threshold. This   aggregates the DataNode process alert.

   潜在原因   ：DataNodes are down.

               DataNodes are not down but are not listening to the correct network port/address.

   解决方法   ：Check for non-operating DataNodes in Ambari Web.

               Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes.

               Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.

□ 警报名称：NameNode RPC Latency

   警报类型   ：METRIC

   描述       ：This host-level alert is triggered if the NameNode operations RPC latency exceeds the configured critical threshold.

               Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to

               increase for NameNode operations.

   潜在原因   ：A job or an application is performing too many NameNode operations.

   解决方法   ：Review the job or the application for potential bugs causing it to perform too many NameNode operations.

□ 警报名称：NameNode Last Checkpoint

   警报类型   ：SCRIPT

   描述       ：This alert will trigger if the last time that the NameNode performed a checkpoint was too long ago or if the number of

               uncommitted transactions is beyond a certain threshold.

   潜在原因   ：Too much time elapsed since last NameNode checkpoint.

               Uncommitted transactions beyond threshold.

   解决方法   ：Set NameNode checkpoint.

               Review threshold for uncommitted transactions.

□ 警报名称：Secondary NameNode Process

   警报类型   ：WEB

   描述       ：If the Secondary NameNode process cannot be confirmed to be up and listening on the network. This alert is not applicable

               when NameNode HA is configured.

   潜在原因   ：The Secondary NameNode is not running.

   解决方法   ：Check that the Secondary DataNode process is running.

□ 警报名称：NameNode Directory Status

   警报类型   ：METRIC

   描述       ：This alert checks if the NameNode NameDirStatus metric reports a failed directory.

   潜在原因   ：One or more of the directories are reporting as not healthy.

   解决方法   ：Check the NameNode UI for information about unhealthy directories.

□ 警报名称：HDFS Capacity Utilization

   警报类型   ：METRIC

   描述       ：This service-level alert is triggered if the HDFS capacity utilization exceeds the configured critical threshold

               (80% warn, 90% critical). It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties.

   潜在原因   ：Cluster storage is full.

   解决方法   ：Delete unnecessary data.

               Archive unused data.

               Add more DataNodes.

               Add more or larger disks to the DataNodes.

               After adding more storage, run the load balancer.

□ 警报名称: DataNode Health Summary

   警报类型   : METRIC

   描述       : This service-level alert is triggered if there are unhealthy DataNodes.

   潜在原因   : A DataNode is in an unhealthy state.

   解决方法   : Check the NameNode UI for the list of non-operating DataNodes.

   □ 警报名称：HDFS Pending Deletion Blocks

   警报类型   : METRIC

   描述       : This service-level alert is triggered if the number of blocks pending deletion in HDFS exceeds the configured warning

               and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property.

   潜在原因   : Large number of blocks are pending deletion.

   解决方法   :

□ 警报名称：HDFS Upgrade Finalized State

   警报类型   : SCRIPT

   描述       : This service-level alert is triggered if HDFS is not in the finalized state.

   潜在原因   : The HDFS upgrade is not finalized.

   解决方法   : Finalize any upgrade you have in process.

  □ 警报名称：DataNode Unmounted Data Dir

   警报类型   : SCRIPT

   描述       : This host-level alert is triggered if one of the data directories on a host was previously on a mount point and became

               unmounted.

   潜在原因   : If the mount history file does not exist, then report an error if a host has one or more mounted data directories as well

               as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the

               root partition, which is undesirable.

   解决方法   : Check the data directories to confirm they are mounted as expected.

□ 警报名称：DataNode Heap Usage

   警报类型   : METRIC

   描述       : This host-level alert is triggered if heap usage goes past thresholds on the DataNode. It checks the DataNode JMXServlet

               for the MemHeapUsedM and MemHeapMaxM properties. The threshold values are percentages.

   潜在原因   :

□ 警报名称：NameNode Client RPC Queue Latency

   警报类型   : SCRIPT

   描述       : This service-level alert is triggered if the deviation of RPC queue latency on client port has grown beyond the specified

               threshold within an given period. This alert will monitor Hourly and Daily periods.

   潜在原因   :

   解决方法   :

□ 警报名称：NameNode Client RPC Processing Latency

   警报类型   : SCRIPT

   描述       : This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified

               threshold within a given period. This alert will monitor Hourly and Daily periods.

   潜在原因   :

   解决方法   :

□ 警报名称：NameNode Service RPC Queue Latency

   警报类型   : SCRIPT

   描述       : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified

               threshold within a given period. This alert will monitor Hourly and Daily periods.

   潜在原因   :

   解决方法   :

□ 警报名称：NameNode Service RPC Processing Latency

   警报类型   : SCRIPT

   描述       : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified

               threshold within a given period. This alert will monitor Hourly and Daily periods.

   潜在原因   :

   解决方法   :

□ 警报名称：HDFS Storage Capacity Usage

   警报类型   : SCRIPT

   描述       : This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified

               threshold within a given period. This alert will monitor Daily and Weekly periods.

   潜在原因   :

   解决方法   :

□ 警报名称：NameNode Heap Usage

   警报类型   : SCRIPT

   描述       : This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold

               within a given period. This alert will monitor Daily and Weekly periods.

   潜在原因   :

   解决方法   :

8.5.2 HDFS HA 警报 (HDFS HA Alerts)

□ 警报名称: JournalNode Web UI

   警报类型   : WEB

   描述       : This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening

               on the network for the configured critical threshold, given in seconds.

   潜在原因   : The JournalNode process is down or not responding.

               The JournalNode is not down but is not listening to the correct network port/address.

   解决方法   :

□ 警报名称: NameNode High Availability Health

   警报类型   : SCRIPT

   描述       : This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.

   潜在原因   : The Active, Standby or both NameNode processes are down.

   解决方法   : On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode

               host/process using Ambari Web.

               On each host running NameNode, run the netstat -tuplpn command to check if the NameNode process is bound to the correct

               network port.

  警报名称: Percent JournalNodes Available

   警报类型   : AGGREGATE

   描述       : This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured

               critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.

   潜在原因   : JournalNodes are down.

               JournalNodes are not down but are not listening to the correct network port/address.



   解决方法   : Check for dead JournalNodes in Ambari Web.

□ 警报名称: ZooKeeper Failover Controller Process

   警报类型   : PORT

   描述       : This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the

               network.

   潜在原因   : The ZKFC process is down or not responding.

   解决方法   : Check if the ZKFC process is running.

8.5.3 NameNode HA 警报 (NameNode HA Alerts)

□ 警报名称: JournalNode Process

   警报类型   : WEB

   描述       : This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening

               on the network for the configured critical threshold, given in seconds.

   潜在原因   : The JournalNode process is down or not responding.

               The JournalNode is not down but is not listening to the correct network port/address.

   解决方法   : Check if the JournalNode process is running.

□ 警报名称: NameNode High Availability Health

   警报类型   : SCRIPT

   描述       : This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.

   潜在原因   : The Active, Standby or both NameNode processes are down.

   解决方法   : On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode

               host/process using Ambari Web.

               On each host running NameNode, run the netstat -tuplpn command to check if the NameNode process is bound to the correct

               network port.

□ 警报名称: Percent JournalNodes Available

   警报类型   : AGGREGATE

   描述       : This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured

               critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.

   潜在原因   : JournalNodes are down.

               JournalNodes are not down but are not listening to the correct network port/address.



   解决方法   : Check for non-operating JournalNodes in Ambari Web.

□ 警报名称: ZooKeeper Failover Controller Process

   警报类型   : PORT

   描述       : This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the

               network.

   潜在原因   : The ZKFC process is down or not responding.

   解决方法   : Check if the ZKFC process is running.

8.5.4 YARN 警报 (YARN Alerts)

□ 警报名称: App Timeline Web UI

   警报类型   : WEB

   描述       : This host-level alert is triggered if the App Timeline Server Web UI is unreachable.

   潜在原因   : The App Timeline Server is down.

               App Timeline Service is not down but is not listening to the correct network port/address.

   解决方法   : Check for non-operating App Timeline Server in Ambari Web.

□ 警报名称: Percent NodeManagers Available

   警报类型   : AGGREGATE

   描述       : This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold.

               It aggregates the results of DataNode process alert checks.

   潜在原因   : NodeManagers are down.

               NodeManagers are not down but are not listening to the correct network port/address.



   解决方法   : Check for non-operating NodeManagers.

               Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManagers hosts/processes, as necessary.

               Run the netstat -tuplpn command to check if the NodeManager process is bound to the correct network port.

□ 警报名称: ResourceManager Web UI

   警报类型   : WEB

   描述       : This host-level alert is triggered if the ResourceManager Web UI is unreachable.

   潜在原因   : The ResourceManager process is not running.

   解决方法   : Check if the ResourceManager process is running.

□ 警报名称: ResourceManager RPC Latency

   警报类型   : METRIC

   描述       : This host-level alert is triggered if the ResourceManager operations RPC latency exceeds the configured critical threshold.

               Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to

               increase for ResourceManager operations.



   潜在原因   : A job or an application is performing too many ResourceManager operations

   解决方法   : Review the job or the application for potential bugs causing it to perform too many ResourceManager operations.

□ 警报名称: ResourceManager CPU Utilization

   警报类型   : METRIC

   描述       : This host-level alert is triggered if CPU utilization of the ResourceManager exceeds certain thresholds (200% warning,

               250% critical). It checks the ResourceManager JMX Servlet for the SystemCPULoad property. This information is only available

               if you are running JDK 1.7.

   潜在原因   : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of

               an issue in the daemon.

   解决方法   : Use the top command to determine which processes are consuming excess CPU.

               Reset the offending process.

□ 警报名称: NodeManager Web UI

   警报类型   : WEB

   描述       : This host-level alert is triggered if the NodeManager process cannot be established to be up and listening on the network

               for the configured critical threshold, given in seconds.

   潜在原因   : NodeManager process is down or not responding.

               NodeManager is not down but is not listening to the correct network port/address.

   解决方法   : Check if the NodeManager is running.

               Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManager, if necessary.

□ 警报名称: NodeManager Health Summary

   警报类型   : SCRIPT

   描述       : This host-level alert checks the node health property available from the NodeManager component.

   潜在原因   : NodeManager Health Check script reports issues or is not configured.

   解决方法   : Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart

               if necessary.

               Check in the ResourceManager UI logs (/var/log/hadoop/yarn) for health check errors.

□ 警报名称: NodeManager Health

   警报类型   : SCRIPT

   描述       : This host-level alert checks the nodeHealthy property available from the NodeManager component.

   潜在原因   : The NodeManager process is down or not responding.

   解决方法   : Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart

               if necessary.

8.5.5 MapReduce2 警报 (MapReduce2 Alerts)

□ 警报名称: History Server Web UI

   警报类型   : WEB

   描述       : This host-level alert is triggered if the HistoryServer Web UI is unreachable.

   潜在原因   : The HistoryServer process is not running.

   解决方法   : Check if the HistoryServer process is running.

□ 警报名称: History Server RPC latency

   警报类型   : METRIC

   描述       : This host-level alert is triggered if the HistoryServer operations RPC latency exceeds the configured critical threshold.

               Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to

               increase for NameNode operations.

   潜在原因   : A job or an application is performing too many HistoryServer operations.

   解决方法   : Review the job or the application for potential bugs causing it to perform too many HistoryServer operations.

□ 警报名称: History Server CPU Utilization

   警报类型   : METRIC

   描述       : This host-level alert is triggered if the percent of CPU utilization on the HistoryServer exceeds the configured

               critical threshold.

   潜在原因   : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of

               an issue in the daemon.

   解决方法   : Use the top command to determine which processes are consuming excess CPU.

               Reset the offending process.

□ 警报名称: History Server Process

   警报类型   : PORT

   描述       : This host-level alert is triggered if the HistoryServer process cannot be established to be up and listening on the

               network for the configured critical threshold, given in seconds.

   潜在原因   : HistoryServer process is down or not responding.

               HistoryServer is not down but is not listening to the correct network port/address.

   解决方法   : Check the HistoryServer is running.

               Check for any errors in the HistoryServer logs (/var/log/hadoop/mapred) and restart the HistoryServer, if necessary.

8.5.6 HBase 服务警报 (HBase Service Alerts)

□ 警报名称: Percent RegionServers Available

   警报类型   :

   描述       : This service-level alert is triggered if the configured percentage of Region Server processes cannot be determined to be

               up and listening on the network for the configured critical threshold. The default setting is 10% to produce a WARN alert

               and 30% to produce a CRITICAL alert. It aggregates the results of RegionServer process down checks.

   潜在原因   : Misconfiguration or less-thanideal configuration caused the RegionServers to crash.

               Cascading failures brought on by some workload caused the RegionServers to crash.

               The RegionServers shut themselves own because there were problems in the dependent services, ZooKeeper or HDFS.

               GC paused the RegionServer for too long and the RegionServers lost contact with Zookeeper.

   解决方法   : Check the dependent services to make sure they are operating correctly.

               Look at the RegionServer log files (usually /var/log/hbase/*.log) for further information.

               If the failure was associated with a particular workload, try to understand the workload better.

               Restart the RegionServers.

□ 警报名称: HBase Master Process

   警报类型   :

   描述       : This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for

               the configured critical threshold, given in seconds.

   潜在原因   : The HBase master process is down.

               The HBase master has shut itself down because there were problems in the dependent services, ZooKeeper or HDFS.

   解决方法   : Check the dependent services.

               Look at the master log files (usually /var/log/hbase/*.log) for further information.

               Look at the configuration files (/etc/hbase/conf).

               Restart the master.

□ 警报名称: HBase Master CPU Utilization

   描述       : This host-level alert is triggered if CPU utilization of the HBase Master exceeds certain thresholds (200% warning,

               250% critical). It checks the HBase Master JMX Servlet for the SystemCPULoad property. This information is only available

               if you are running JDK 1.7.

   潜在原因   : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of

               an issue in the daemon.



   解决方法   : Use the top command to determine which processes are consuming excess CPU

               Reset the offending process.

□ 警报名称: RegionServers Health Summary

   描述       : This service-level alert is triggered if there are unhealthy RegionServers

   潜在原因   : The RegionServer process is down on the host.

               The RegionServer process is up and running but not listening on the correct network port (default 60030).

   解决方法   : Check for dead RegionServer in Ambari Web.

□ 警报名称: HBase RegionServer Process

   描述       : This host-level alert is triggered if the RegionServer processes cannot be confirmed to be up and listening on the

               network for the configured critical threshold, given in seconds.

   潜在原因   : The RegionServer process is down on the host.

               The RegionServer process is up and running but not listening on the correct network port (default 60030).

   解决方法   : Check for any errors in the logs (/var/log/hbase/) and restart the RegionServer process using Ambari Web.

               Run the netstat -tuplpn command to check if the RegionServer process is bound to the correct network port.

8.5.7 Hive 警报 (Hive Alerts)

□ 警报名称: HiveServer2 Process

   警报类型   :

   描述       : This host-level alert is triggered if the HiveServer cannot be determined to be up and responding to client requests.

   潜在原因   : HiveServer2 process is not running.

               HiveServer2 process is not responding.

   解决方法   : Using Ambari Web, check status of HiveServer2 component. Stop and then restart.

□ 警报名称: HiveMetastore Process

   描述       : This host-level alert is triggered if the Hive Metastore process cannot be determined to be up and listening on the

               network for the configured critical threshold, given in seconds.

   潜在原因   : The Hive Metastore service is down.

               The database used by the Hive Metastore is down.

               The Hive Metastore host is not reachable over the network.

   解决方法   : Using Ambari Web, stop the Hive service and then restart it.

□ 警报名称: WebHCat Server Status

   警报类型   :

   描述       : This host-level alert is triggered if the WebHCat server cannot be determined to be up and responding to client requests.

   潜在原因   : The WebHCat server is down.

               The WebHCat server is hung and not responding.

               The WebHCat server is not reachable over the network.

   解决方法   : Restart the WebHCat server using Ambari Web.

8.5.8 Oozie 警报 (Oozie Alerts)

□ 警报名称: Oozie Server Web UI

   描述       : This host-level alert is triggered if the Oozie server Web UI is unreachable.

   潜在原因   : The Oozie server is down.

               Oozie Server is not down but is not listening to the correct network port/address.

   解决方法   : Check for dead Oozie Server in Ambari Web.

□ 警报名称: Oozie Server Status

   描述       : This host-level alert is triggered if the Oozie server cannot be determined to be up and responding to client requests.

   潜在原因   : The Oozie server is down.

               The Oozie server is hung and not responding.

               The Oozie server is not reachable over the network.

   解决方法   : Restart the Oozie service using Ambari Web.

8.5.9 ZooKeeper 警报 (ZooKeeper Alerts)

□ 警报名称: Percent ZooKeeper Servers Available

   警报类型   : AGGREGATE

   描述       : This service-level alert is triggered if the configured percentage of ZooKeeper processes cannot be determined to be up

               and listening on the network for the configured critical threshold, given in seconds. It aggregates the results of

               ZooKeeper process checks.

   潜在原因   : The majority of your ZooKeeper servers are down and not responding.

   解决方法   : Check the dependent services to make sure they are operating correctly.

               Check the ZooKeeper logs (/var/log/hadoop/zookeeper.log) for further information.

               If the failure was associated with a particular workload, try to understand the workload better.

               Restart the ZooKeeper servers from the Ambari UI.

□ 警报名称: ZooKeeper Server Process

   警报类型   : PORT

   描述       : This host-level alert is triggered if the ZooKeeper server process cannot be determined to be up and listening on the

               network for the configured critical threshold, given in seconds.

   潜在原因   : The ZooKeeper server process is down on the host.

               The ZooKeeper server process is up and running but not listening on the correct network port (default 2181).

   解决方法   : Check for any errors in the ZooKeeper logs (/var/log/hbase/) and restart the ZooKeeper process using Ambari Web.

               Run the netstat -tuplpn command to check if the ZooKeeper server process is bound to the correct network port.

8.5.10 Ambari 警报 (Ambari Alerts)

□ 警报名称: Host Disk Usage

   警报类型   : SCRIPT

   描述       : This host-level alert is triggered if the amount of disk space used on a host goes above specific thresholds (50% warn,

               80% crit ).

   潜在原因   : The amount of free disk space left is low.

   解决方法   : Check host for disk space to free or add more storage.

□ 警报名称: Ambari Agent Heartbeat

   警报类型   : SERVER

   描述       : This alert is triggered if the server has lost contact with an agent.

   潜在原因   : Ambari Server host is unreachable from Agent host

               Ambari Agent is not running

   解决方法   : Check connection from Agent host to Ambari Server

               Check Agent is running

□ 警报名称: Ambari Server Alerts

   警报类型   : SERVER

   描述       : This alert is triggered if the server detects that there are alerts which have not run in a timely manner

   潜在原因   : Agents are not reporting alert status

               Agents are not running

   解决方法   : Check that all Agents are running and heartbeating

8.5.11 Ambari Metrics 警报 (Ambari Metrics Alerts)

□ 警报名称: Metrics Collector Process

   描述       : This alert is triggered if the Metrics Collector cannot be confirmed to be up and listening on the configured port for

               number of seconds equal   to threshold.

   潜在原因   : The Metrics Collector process is not running.

   解决方法   : Check the Metrics Collector is running.

□ 警报名称: Metrics Collector –ZooKeeper Server Process

   警报类型   :

   描述       : This host-level alert is triggered if the Metrics Collector ZooKeeper Server Process cannot be determined to be up and

               listening on the network.

   潜在原因   : The Metrics Collector process is not running.

   解决方法   : Check the Metrics Collector is running.

□ 警报名称: Metrics Collector –HBase Master Process

   警报类型   :

   描述       : This alert is triggered if the Metrics Collector HBase Master Processes cannot be confirmed to be up and listening on

               the network for the configured critical threshold, given in seconds.

   潜在原因   : The Metrics Collector process is not running.

   解决方法   : Check the Metrics Collector is running.

□ 警报名称: Metrics Collector – HBase Master CPU Utilization

   警报类型   :

   描述       : This host-level alert is triggered if CPU utilization of the Metrics Collector exceeds certain thresholds.

   潜在原因   : Unusually high CPU utilization generally the sign of an issue in the daemon configuration.

   解决方法   : Tune the Ambari Metrics Collector.

□ 警报名称: Metrics Monitor Status

   警报类型   :

   描述       : This host-level alert is triggered if the Metrics Monitor process cannot be confirmed to be up and running on the network.

   潜在原因   : The Metrics Monitor is down.

   解决方法   : Check whether the Metrics Monitor is running on the given host.

□ 警报名称: Percent Metrics Monitors Available

   描述       : This is an AGGREGATE alert of the Metrics Monitor Status.

   潜在原因   : Metrics Monitors are down.

   解决方法   : Check the Metrics Monitors are running.

□ 警报名称: Metrics Collector -Auto-Restart Status

   描述       : This alert is triggered if the Metrics Collector has been auto-started for number of times equal to start threshold in

               a 1 hour timeframe. By default if restarted 2 times in an hour, you will receive a Warning alert. If restarted 4 or more

               times in an hour, you will receive a Critical alert.

   潜在原因   : The Metrics Collector is running but is unstable and causing restarts. This could be due to improper tuning.

   解决方法   : Tune the Ambari Metrics Collector.

□ 警报名称: Percent Metrics Monitors Available

   描述       : This is an AGGREGATE alert of the Metrics Monitor Status.

   潜在原因   : Metrics Monitors are down.

   解决方法   : Check the Metrics Monitors.

□ 警报名称: Grafana Web UI

   描述       : This host-level alert is triggered if the AMS Grafana Web UI is unreachable.

   潜在原因   : Grafana process is not running.

   解决方法   : Check whether the Grafana process is running. Restart if it has gone down.

8.5.12 SmartSenses 警报 (SmartSense Alerts)

□ 警报名称: SmartSense Server Process

   描述       : This alert is triggered if the HST server process cannot be confirmed to be up and listening on the network for the

               configured critical threshold, given in seconds.

   潜在原因   : HST server is not running.

   解决方法   : Start HST server process. If startup fails, check the hst-server.log.

□ 警报名称: SmartSense Bundle Capture Failure

   描述       : This alert is triggered if the last triggered SmartSense bundle is failed or timed out.

   潜在原因   : Some nodes are timed out during capture or fail during data capture. It could also be because upload to Hortonworks fails.

   解决方法   : From the "Bundles" page check the status of bundle. Next, check which agents have failed or timed out, and review their logs.

               You can also initiate a new capture.

□ 警报名称: SmartSense Long Running Bundle

   描述       : This alert is triggered if the SmartSense in-progress bundle has possibility of not completing successfully on time.

   潜在原因   : Service components that are getting collected may not be running. Or some agents may be timing out during data

               collection/upload.

   解决方法   : Restart the services that are not running. Force-complete the bundle and start a new capture.

□ 警报名称: SmartSense Gateway Status

   描述       : This alert is triggered if the SmartSense Gateway server process is enabled but is unable to reach.

   潜在原因   : SmartSense Gateway is not running.

   解决方法   : Start the gateway. If gateway start fails, review hst-gateway.log

8.6 管理通知 (Managing Notifications)

利用警报组和通知可以创建警报分组，并为每个分组设置通知目标，通过这种方式可以把一组警报以不同的方式发送给不同的集群参与者。例如，可能想要

Hadoop Operations team 通过 email 接收所有的警报，不管警报是什么状态，同时，想要系统管理员小组只接收 RPC 和 CPU 相关的 Critical 状态的警报，

并且只通过 simple network management protocol(SNMP) 方式接收。

为了实现这些不同的结果，可以用一个警报通知，用于管理对所有警报组的所有的严重级别的 email 通知，用一个不同的警报组来管理 SNMP 方式发送的

Critical 严重性级别的警报通知，只包含 RPC 和 CPU 警报。

8.7 创建和编辑通知 (Creating and Editing Notifications)

   ① Ambari Web 中, 单击 Alerts

   ② 在 Alerts 页面，单击 Actions 菜单，然后单击 Manage Notifications

   ③ 在 Manage Alert Notifications 中，单击 + 创建一个新的警报通知

       在 Create Alert Notification 中

       ● 在 Name 文本框，输入通知的名称

       ● 在 Groups 字段，单击 All 或 Custom 分配通知给所有或设置的组

       ● 在 Description 字段，输入描述通知的短语

       ● 在 Method 字段，单击 EMAIL, SNMP (for MIB-based) 或 Custom SNMP 作为 Ambari server 发送通知的方法

   ④ 完成所选择的通知方法字段定义

       ● 对于 email 通知，提供有关 SMTP 的信息，如，SMTP server, port ,以及 from 地址，服务器是否要求认证

       可以对 SMTP 配置添加自定义的属性，基于Javamail SMTP

           Email To           ：由一个或多个 email 地址组成的逗号分隔的列表，用于发送警报给这些 email 地址

           SMTP Server           ：用于发送警报 email 的 SMTP server 的 FDQN 或 IP 地址

           SMTP Port           ：SMTP server 的 SMTP 端口

           Email From           ：一个 email 地址用于发送警报 email 的发送者

           Use Authentication   ：确定在进行发送消息之前， SMTP server 是否要求身份验证。也要提供用户名和密码凭证



       ● 对于 MIB-based SNMP 通知，提供版本，community, 主机和端口，用于 SNMP trap 发送

           Version       ：SNMPv1 或 SNMPv2c, 取决于网络环境

           Hosts       ：逗号分隔的一个或多个主机 FDQN 列表，用于发送 trap

           Port       ：进程用于监听 SNMP traps 的端口

           对于 SNMP 通知， Ambari 使用 "MIB", 一个文本文件警报定义的清单，来传输警报信息。MIB 概述了对象 ID 如何

           映射为对象或属性。

           可以在 Ambari server 主机上找到集群的 MIB 文件：

               /var/lib/ambari-server/resources/APACHE-AMBARI-MIB.txt

       ● 对于自定义 SNMP 通知，提供版本，community, 主机和端口，用于 SNMP trap 发送。

          OID 参数必须配置正确，如果没有自定义，使用 enterprise-specific OID

          Version SNMPv1 or SNMPv2c, depending on the network environment

       OID 1.3.6.1.4.1.18060.16.1.1

       Hosts A comma-separated list of one or more host FQDNs to which to send the

       trap

       Port The port on which a process is listening for SNMP traps

   ⑤ 单击 Save

8.8 创建或编辑通知组 (Creating or Editing Alert Groups)

   ① Ambari Web 中, 单击 Alerts

   ② 在 Alerts 页面，单击 Actions 菜单，然后单击 Manage Alert Groups

   ③ 在 Manage Alert Groups 中，单击 + 创建一个新的警报组

   ④ 在 Create Alert Group 中，输入组名称然后单击 Save

   ⑤ 通过在列表中单击自定义的组，可以添加或删除警报定义，并可以改变该组的通知目标

   ⑥ 完成分配之后，单击 Save

8.9 分发通知 (Dispatching Notifications)

当启用了一个警报并且警报的状态发生变化时(例如，从 OK 变为 CRITICAL, 或从 CRITICAL 变为 OK), Ambari 或者发送一个 email 或 SNMP 通知，取决于

如何配置的通知。

对于 email 通知，Ambari 发送一封 email 包含所有警报状态的变化。例如，如果有两个警报变为 critical, Ambari 发送一封 email 消息：

   Alert A is CRITICAL and Ambari B alert is CRITICAL

Ambari 不会发送另外一封 email 通知，直到状态再次发生变化。

对于 SNMP 通知，Ambari 每个警报状态变化发送一个 SNMP trap. 例如，有两个警报状态变为 critical, Ambari 发送两个 SNMP trap, 每个警报一个，然后

这两个警报状态再次变化时，再次发送。

8.10 查看警报状态日志 (Viewing the Alert Status Log)

不管 Ambari 是否配置为发送警报通知，它都会将警报状态的变化写入 Ambari server 主机的日志。查看日志：

   ① 在 Ambari server 主机上，浏览到日志目录

       cd /var/log/ambari-server/



   ② 查看 ambari-alerts.log 文件

   ③ 日志条目包括状态变化的时间，警报状态，警报定义名称，以及响应文本

8.10.1 自定义通知模板 (Customizing Notification Templates)

由 Ambari 产生的通知模板内容取决于通知的类型。Email 和 SNMP 通知都有自定义的模板用于生成内容。本节描述改变用于 Ambari 创建警报通知模板的

必要步骤。

   警报模板的 XML 位置

   默认情况下，Ambari 自带有一个 alert-templates.xml 文件。这个文件包含每一个已知类型通知的所有的模板(例如， EMAIL 和 SNMP). 这个文件

   打包到 Ambari server 的 .jar 文件，因此模板没有存在于磁盘上。但是，这个文件用于如下文本，作为一个参考示例。

   当自定义警报模板时，可以高效得覆盖默认的警报模板的 XML, 如下：

   ① 在 Ambari server 主机上，浏览到 /etc/ambari-server/conf 目录

   ② 编辑 ambari.properties 文件

   ③ 为新模板添加一个位置条目

       alerts.template.file=/foo/var/alert-templates-custom.xml

   ④ 保存文件并重启 Ambari Server

   重启 Ambari Server 之后，新模板中定义的任何通知类型都会覆盖打包在 Ambari 中的模板定义。如果选择提供自己的模板文件，只需要定义希望覆盖

   的类型。如果一个通知模板类型在自定义的模板中没有找到，Ambari 会使用打包到 JAR 文件中的默认模板。

   警报模板的 XML 结构

   模板文件的结构定义如下。每个 <alert-template> 元素声明警报通知要用于什么类型：

   <alert-templates>

       <alert-template type="EMAIL">

           <subject>

               Subject Content

           </subject>

           <body>

               Body Content

           </body>

       </alert-template>

       <alert-template type="SNMP">

           <subject>

               Subject Content

           </subject>

           <body>

               Body Content

           </body>

       </alert-template>

   </alert-templates>

   模板变量

   模板利用 Apache Velocity 来表现所有标记的内容(tokenized content). 下面的变量可用于模板：

   $alert.getAlertDefinition() The definition of which the alert is an instance.

   $alert.getAlertText() The specific alert text.

   $alert.getAlertName() The name of the alert.

   $alert.getAlertState() The alert state (OK, WARNING, CRITICAL, or

   UNKNOWN)

   $alert.getServiceName() The name of the service that the alert is defined for.

   $alert.hasComponentName() True if the alert is for a specific service component.

   $alert.getComponentName() The component, if any, that the alert is defined for.

   $alert.hasHostName() True if the alert was triggered for a specific host.

   $alert.getHostName() The hostname, if any, that the alert was triggered for.

   $ambari.getServerUrl() The Ambari Server URL.

   $ambari.getServerVersion() The Ambari Server version.

   $ambari.getServerHostName() The Ambari Server hostname.

   $dispatch.getTargetName() The notification target name.

   $dispatch.getTargetDescription() The notification target description.

   $summary.getAlerts(service,alertStaAte li)st of all alerts for a given service or alert state (OK|

   WARNING|CRITICAL|UNKNOWN)

   $summary.getServicesByAlertState(Aal elirsttS otaf tael)l services for a given alert state (OK|

   WARNING|CRITICAL|UNKNOWN)

   $summary.getServices() A list of all services that are reporting an alert in the

   notification.

   $summary.getCriticalCount() The CRITICAL alert count.

   $summary.getOkCount() The OK alert count.

   $summary.getTotalCount() The total alert count.

   $summary.getUnknownCount() The UNKNOWN alert count.

   $summary.getWarningCount() The WARNING alert count.

   $summary.getAlerts() A list of all of the alerts in the notification.

   示例：Modify Alert EMAIL Subject

   下面示例演示如何改变所有出站 email 通知的主题行(subject line), 包括一个硬编码的标识符：

① 下载 alert-templates.xml 代码作为开始

② 在 Ambari Server 上，保存模板到一个位置，例如，/var/lib/ambariserver/ resources/alert-templates-custom.xml

③ 编辑 alert-templates-custom.xml 文件并修改 <alerttemplate type="EMAIL"> 模板的主题行

       <subject>

       <![CDATA[Petstore Ambari has $summary.getTotalCount() alerts!]]>

       </subject>

   ④ 保存文件

   ⑤ 浏览到 /etc/ambari-server/conf 目录

   ⑥ 编辑 ambari.properties 文件

   ⑦ 为新模板文件的位置添加一条目

       alerts.template.file=/var/lib/ambari-server/resources/alerttemplates-custom.xml

   ⑧ 保存文件并重启 Ambari Server

9. 使用 Ambari 核心服务 (Using Ambari Core Services)

Ambari 核心服务可用于监控，分析，以及搜索集群主机的操作状态。

9.1 理解 Ambari 度量器 (Understanding Ambari Metrics)

Ambari Metrics System (AMS) 在 Ambari 管理的集群上收集，聚集，并服务于 Hadoop 和系统度量

9.1.1 AMS 体系结构 (AMS Architecture)

AMS 有四个组件：Metrics Monitors, Hadoop Sinks, Metrics Collector, 以及 Grafana.

   • Metrics Monitors   ：在集群的每部主机上收集系统级别的度量并发布到 Metrics Collector 上

   • Hadoop Sinks       ：插入到 Hadoop 组件中用于发布 Hadoop 度量到 Metrics Collector 上

   • Metrics Collector   ：是一个运行在集群上特定主机中的 daemon 并从注册的发布者接收数据，Monitors 和 Sinks

   • Grafana            ：是一个运行在集群上特定主机中的 daemon，并为在 Metrics Collector 中收集到的 metrics 的可视化提供预构建表盘

9.1.2 使用 Grafana (Using Grafana)

Ambari Metrics System 包括 Grafana 用于为高级可视化集群度量提供预构建表盘。

9.1.2.1 访问 Grafana (Accessing Grafana)

   ① Ambari Web 中，浏览到 Services > Ambari Metrics > Summary

   ② 选择 Quick Links 然后选取 Grafana

   一个只读版本的 Grafana 页面在浏览器的一个新 tab 页面打开

9.1.2.2 查看 Grafana 表盘(Viewing Grafana Dashboards)

在 Grafana 主页上，Dashboards 提供了一个 AMS 链接列表，Ambari server, Druid and HBase metrics.

查看包含在列表中的特定 metric:

   ① 在 Grafana 中，浏览到 Dashboards

   ② 单击 Dashboards 名称

   ③ 查看更多表盘，单击 Home 列表

   ④ 滚动查看这个列表

   例如，System - Servers

9.1.2.3 在 Grafana 表盘上查看选择的 Metrics (Viewing Selected Metrics on Grafana Dashboards)

在表盘上，展开一个或多个行以查看详细的度量。例如：

在 System - Servers 表盘上，单击行名称，例如单击 System Load Average - 1 Minute

这个行展开以显示一个图表显示度量信息。

9.1.2.4 查看选定主机的 Metrics (Viewing Metrics for Selected Hosts)

默认情况下，Grafana 显示集群上所有主机 metric. 通过从 Hosts 菜单上选择，可以限制显示一个或几个主机的 metric

   ① 展开 Hosts

   ② 选择一个或多个主机名

9.1.3 Grafana 表盘参考 (Grafana Dashboards Reference)

Ambari Metrics System 包含的 Grafana 为集群 metrics 的高级可视化带有预构建的表盘。

   • AMS HBase Dashboards

   • Ambari Dashboards

   • HDFS Dashboards

   • YARN Dashboards

   • Hive Dashboards

   • Hive LLAP Dashboards

   • HBase Dashboards

   • Kafka Dashboards

   • Storm Dashboards

   • System Dashboards

   • NiFi Dashboard

9.1.3.1 AMS HBase 表盘 (AMS HBase Dashboards)

AMS HBase 指的是由 Ambari Metrics Service 独立管理的 HBase 实例。它与集群的 HBase 服务没有任何连接。AMS HBase 表盘跟踪与常规 HBase 表盘

相同的度量，只是 AMS 自身的实例。

如下的 Grafana 表盘适用于 AMS HBase

   • AMS HBase - Home

   • AMS HBase - RegionServers

   • AMS HBase - Misc

9.1.3.1.1 AMS HBase 表盘 (AMS HBase - Home)

AMS HBase - Home 表盘显示 HBase 集群基本的统计信息，这些仪表提供了 HBase 集群整体状态的观察。

   REGIONSERVERS / REGIONS

   -------------------------------------------------------------------------------------------------------------------------------------

   Num RegionServers                   : Total number of RegionServers in the cluster.

   Num Dead RegionServers               : Total number of RegionServers that are dead in the cluster.

   Num Regions                           : Total number of regions in the cluster.

   Avg Num Regions per RegionServer   : Average number of regions per RegionServer.

   NUM REGIONS/STORES

   Num Regions /Stores - Total           : Total number of regions and stores (column families) in the cluster.

   Store File Size /Count - Total       : Total data file size and number of store files.

   NUM REQUESTS

   Num Requests - Total               : Total number of requests (read, write and RPCs) in the cluster.

   Num Request - Breakdown - Total       : Total number of get,put,mutate,etc requests in the cluster.

   REGIONSERVER MEMORY

   RegionServer Memory - Average       : Average used, max or committed on-heap and offheap memory for RegionServers.

   RegionServer Offheap Memory - Average   : Average used, free or committed on-heap and offheap memory for RegionServers.

MEMORY - MEMSTORE BLOCKCACHE

   Memstore - BlockCache - Average       : Average blockcache and memstore sizes for RegionServers.

   Num Blocks in BlockCache - Total   : Total number of (hfile) blocks in the blockcaches across all RegionServers.

   BLOCKCACHE

   BlockCache Hit/Miss/s Total           : Total number of blockcache hits misses and evictions across all RegionServers.

   BlockCache Hit Percent - Average   : Average blockcache hit percentage across all RegionServers.

   OPERATION LATENCIES - GET/MUTATE

   Get Latencies - Average               : Average min, median, max, 75th, 95th, 99th percentile latencies for Get operation across

                                       all RegionServers.

   Mutate Latencies - Average           : Average min, median, max, 75th, 95th, 99th percentile latencies for Mutate operation across

                                       all RegionServers.

   OPERATION LATENCIES - DELETE/INCREMENT

   Delete Latencies - Average           : Average min, median, max, 75th, 95th, 99th percentile latencies for Delete operation across

                                       all RegionServers.

   Increment Latencies - Average       : Average min, median, max, 75th, 95th, 99th percentile latencies for Increment operation across

                                       all RegionServers.

   OPERATION LATENCIES - APPEND/REPLAY

   Append Latencies - Average           : Average min, median, max, 75th, 95th, 99th percentile latencies for Append operation across

                                       all RegionServers.

   Replay Latencies - Average           : Average min, median, max, 75th, 95th, 99th percentile latencies for Replay operation across

                                       all RegionServers.

   REGIONSERVER RPC

   RegionServer RPC - Average           : Average number of RPCs, active handler threads and open connections across all RegionServers.

   RegionServer RPC Queues - Average   : Average number of calls in different RPC scheduling queues and the size of all requests in the

                                       RPC queue across all RegionServers.

   REGIONSERVER RPC

   RegionServer RPC Throughput - Average   : Average sent and received bytes from the RPC across all RegionServers.

9.1.3.1.2 AMS HBase 表盘 (AMS HBase - RegionServers)

AMS HBase - RegionServers 仪表显示在监控的 HBase 集群中的 RegionServers 度量，包括一些性能相关的数据。这些仪表帮助查看基本 I/O 数据，以及

RegionServers 中进行负载比较。

9.1.3.1.3 AMS HBase 表盘 (AMS HBase - Misc)

AMS HBase - Misc 仪表显示 HBase 集群相关的多方面的度量信息。可以在某些任务中利用这些度量信息，例如，调试身份认证，授权问题，以及由

RegionServers 产生的异常问题等。

9.1.3.2 Ambari 表盘 (Ambari Dashboards)

下面的仪表可用于 Ambari ：

   • Ambari Server Database

   • Ambari Server JVM

   • Ambari Server Top N

9.1.3.2.1 Ambari server 数据库 (Ambari Server Database)

显示 Ambari server 数据库的操作状态。

   TOTAL READ ALL QUERY

   Total Read All Query Counter (Rate)       : Total ReadAllQuery operations performed.

   Total Read All Query Timer (Rate)       : Total time spent on ReadAllQuery.

   TOTAL CACHE HITS & MISSES

   Total Cache Hits (Rate)                   : Total cache hits on Ambari Server with respect to EclipseLink cache.

   Total Cache Misses (Rate)               : Total cache misses on Ambari Server with respect to EclipseLink cache.

   QUERY

   Query Stages Timings                   : Average time spent on every query sub stage by Ambari Server

   Query Types Avg. Timings               : Average time spent on every query type by Ambari Server.

   HOST ROLE COMMAND ENTITY

   Counter.ReadAllQuery.HostRoleCommandEntity (Rate)   : Rate (num operations per second) in which ReadAllQuery operation on

                                                       HostRoleCommandEntity is performed.

   Timer.ReadAllQuery.HostRoleCommandEntity (Rate)       : Rate in which ReadAllQuery operation on HostRoleCommandEntity is performed.

   ReadAllQuery.HostRoleCommandEntity                   : Average time taken for a ReadAllQuery operation on HostRoleCommandEntity (

                                                       Timer / Counter).

9.1.3.2.2 Ambari server JVM (Ambari Server JVM)

   JVM - MEMORY PRESSURE

   Heap Usage           : Used, max or committed on-heap memory for Ambari Server.

   Off-Heap Usage       : Used, max or committed off-heap memory for Ambari Server.

   JVM GC COUNT

   GC Count Par new /s   : Number of Java ParNew (YoungGen) Garbage Collections per second.

   GC Time Par new /s   : Total time spend in Java ParNew(YoungGen) Garbage Collections per second.

   GC Count CMS /s       : Number of Java Garbage Collections per second.

   GC Time Par CMS /s   : Total time spend in Java CMS Garbage Collections per second.

   JVM THREAD COUNT

   Thread Count       : Number of active, daemon, deadlock, blocked and runnable threads.

9.1.3.2.3 Ambari Server Top N (Ambari Server Top N)

   READ ALL QUERY

   Top ReadAllQuery Counters   : Top N Ambari Server entities by number of ReadAllQuery operations performed.

   Top ReadAllQuery Timers       : Top N Ambari Server entities by time spent on ReadAllQuery operations.

   Cache Misses

   Cache Misses               : Top N Ambari Server entities by number of Cache Misses.

9.1.3.3 Druid Dashboards

9.1.3.4 HDFS Dashboards

如下 Grafana 仪表适用于 Hadoop Distributed File System (HDFS) 组件

   • HDFS - Home

   • HDFS - NameNodes

   • HDFS - DataNodes

   • HDFS - Top-N

   • HDFS - Users

9.1.3.5 YARN Dashboards

如下 Grafana 仪表适用于 YARN:

   • YARN - Home

   • YARN - Applications

   • YARN - MR JobHistory Server

   • YARN - MR JobHistory Server

   • YARN - NodeManagers

   • YARN - Queues

   • YARN - ResourceManager

9.1.3.6 Hive Dashboards

如下 Grafana 仪表适用于 Hive:

   • Hive - Home

   • Hive - HiveMetaStore

   • Hive - HiveServer2

9.1.3.7 Hive LLAP Dashboards

如下 Grafana 仪表适用于 Hive LLAP:

   • Hive LLAP - Heatmap

   • Hive LLAP - Overview

   • Hive LLAP - Daemon

9.1.3.8 HBase Dashboards

如下 Grafana 仪表适用于 Hive HBase:

   • HBase - Home

   • HBase - RegionServers

   • HBase - Misc

   • HBase - Tables

   • HBase - Users

9.1.3.9 Kafka Dashboards

如下 Grafana 仪表适用于 Hive Kafka:

   • Kafka - Home

   • Kafka - Hosts

   • Kafka - Topics

9.1.3.10 Storm Dashboards

如下 Grafana 仪表适用于 Hive Storm:

   • Storm - Home

   • Storm - Topology

   • Storm - Components

9.1.3.11 System Dashboards

如下 Grafana 仪表适用于 Hive System:

   • System - Home

   • System - Servers

9.1.3.12 NiFi Dashboards

如下 Grafana 仪表适用于 Hive NiFi:

   • NiFi-Home

9.1.4 AMS 性能调优 (AMS Performance Tuning)

要在环境中设置 Ambari Metrics System, 查看并自定义如下 Metrics Collector 配置选项：

   • Customizing the Metrics Collector Mode

   • Customizing TTL Settings

   • Customizing Memory Settings

   • Customizing Cluster-Environment-Specific Settings

   • Moving the Metrics Collector

   • (Optional) Enabling Individual Region, Table, and User Metrics for HBase

9.1.4.1 自定义 Metrics Collector 模式 (Customizing the Metrics Collector Mode)

Metrics Collector 利用 Hadoop 技术构建，例如 Apache HBase, Apache Phoenix, and Apache Traffic Server (ATS).   Collector 可存储度量数据到本地

文件系统，成为 embedded mode, 或使用外部 HDFS, 成为 distributed mode. 默认情况下，Collector 运行于嵌入模式。在嵌入模式下，Collector 获取

数据并把度量数据写入到运行 Collector 主机的本地文件系统。

   重要提示：

       运行嵌入模式时，应该确认 hbase.rootdir 和 hbase.tmp.dir 有足够的大小容纳数据，并且负载要轻。目录配置在

           Ambari Metrics > Configs > Advanced > ams-hbasesite

       所在分区要有足够的大小，并且负载不要繁重，例如：

       file:///grid/0/var/lib/ambari-metrics-collector/hbase.

       也要确认 TTL 设置合适。

Collector 配置为分布式模式，它将度量数据写入到 HDFS, 并且组件运行于分布式进程上，有助于管理 CPU 和内存消耗。

切换 Metrics Collector 从嵌入模式到分布式模式：

   ① 在 Ambari Web 中, 浏览到 Services > Ambari Metrics > Configs

   ② 修改列于如下表格中的属性值：





+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

| Configuration Section   | Property                                   | Description                   | Value                           |

+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

| General               |Metrics Service operation mode            | Designates whether to run in   |distributed                   |

|                       |(timeline.metrics.service.operation.mode)   | distributed or embedded mode.   |                               |

+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

| Advanced amshbase-site| hbase.cluster.distributed                   | Indicates AMS will run in    |true                           |

|                       |                                           | distributed mode.               |                               |

+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

| Advanced amshbase-site| hbase.rootdir 1                           |The HDFS directory location    |hdfs://$NAMENODE_FQDN:8020/   |

|                       |                                           |where metrics will be stored   |apps/ams/metrics               |

+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

   ③ Ambari Web > Hosts > Components 重启 Metrics Collector

   如果集群配置为 NameNode 高可用性，设置 hbase.rootdir 值为 HDFS 名称服务替代 NameNode 主机名称：

       hdfs://hdfsnameservice/apps/ams/metrics

   可选地，可以在切换到分布式模式之前，将本地存储的现有数据迁移到 HDFS。

   步骤：

       ① 为 ams 用户创建目录

           su - hdfs -c 'hdfs dfs -mkdir -p /apps/ams/metrics'

       ② 停止 Metrics Collector

       ③ 将度量数据从 AMS 本地目录复制到 HDFS 目录。这是 hbase.rootdir 值，如：

           su - hdfs -c 'hdfs dfs -copyFromLocal /var/lib/ambari-metrics-collector/hbase/* /apps/ams/metrics'

           su - hdfs -c 'hdfs dfs -chown -R ams:hadoop /apps/ams/metrics'

       ④ 切换到分布式模式

       ⑤ 重启 Metrics Collector

9.1.4.2 自定义 TTL 设置 (Customizing TTL Settings)

AMS 可以为聚集的度量设置 Time To Live (TTL), 通过 Ambari Metrics > Configs > Advanced ams-siteEach 自解释的属性名，以及控制度量值在其被

清除之前保持的时间数量(单位，秒)。TTL 设置的时间值单位为秒。

例如，假设正在运行一个单节点的沙箱(a single-node sandbox), 并且要确保不保存超过七天的数据，以降低磁盘空间消耗。可以设置任何以 .ttl 结尾的

属性值为 604800(七天的秒数)。

可能要为 timeline.metrics.cluster.aggregator.daily.ttl 属性设置这个值，控制每日聚集 TTL, 默认设置为 2 年。

另外两个消耗大量磁盘空间的属性为：

   • timeline.metrics.cluster.aggregator.minute.ttl    : 控制分钟级聚集度量 TTL

   • timeline.metrics.host.aggregator.ttl               : 控制基于主机精度的度量 TTL

9.1.4.3 自定义 Memory 设置 (Customizing Memory Settings)

因为 AMS 使用多个组件(例如 Apache HBase 和 Apache Phoenix) 来存储度量和查询，因此多个可调控的属性可用于调优内存使用：

   +---------------------------+-------------------------------+-------------------------------------------------------------------+

   |        配置               |        属性                   |       描述                                                       |

   +---------------------------+-------------------------------+-------------------------------------------------------------------+

   | Advanced ams-env           | metrics_collector_heapsize   | Heap size configuration for the Collector.                       |

   +---------------------------+-------------------------------+-------------------------------------------------------------------+

   | Advanced ams-hbase-env   | hbase_regionserver_heapsize   | Heap size configuration for the single AMS HBase Region Server.   |

   +---------------------------+-------------------------------+-------------------------------------------------------------------+

   | Advanced ams-hbase-env   | hbase_master_heapsize           | Heap size configuration for the single AMS HBase Master.           |

   +---------------------------+-------------------------------+-------------------------------------------------------------------+

   | Advanced ams-hbase-env   | regionserver_xmn_size           | Maximum value for the young generation heap size for the single   |

   |                           |                               | AMS HBase RegionServer.                                           |

   +---------------------------+-------------------------------+-------------------------------------------------------------------+

   | Advanced ams-hbase-env   | hbase_master_xmn_size           | Maximum value for the young generation heap size for the single   |

   |                           |                               | AMS HBase Master.                                                   |

   +---------------------------+-------------------------------+-------------------------------------------------------------------+

9.1.4.4 自定义集群环境特定的设置 (Customizing Cluster-Environment-Specific Settings)

对 AMS 的 Metrics Collector 模式，TTL 设置，内存设置，以及磁盘空间要求取决于集群的节点数量。下面表格列出对每种配置的建议和调优原则：

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

       |集群环境        | 主机数量   | 磁盘空间   | Collector    | TTL           | Memory 设置                       |

       |               |           |           | 模式           |               |                                   |

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

       |Single-Node   | 1           | 2GB       | embedded       |Reduce TTLs   |metrics_collector_heap_size=1024   |

       |Sandbox       |           |           |               |to 7 Days       |hbase_regionserver_heapsize=512   |

       |               |           |           |               |               |hbase_master_heapsize=512           |

       |               |           |           |               |               |hbase_master_xmn_size=128           |

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

       |PoC           | 1-5       | 5GB       | embedded       |Reduce TTLs   |metrics_collector_heap_size=1024   |

       |               |           |           |               |to 30 Days       |hbase_regionserver_heapsize=512   |

       |               |           |           |               |               |hbase_master_heapsize=512           |

       |               |           |           |               |               |hbase_master_xmn_size=128           |

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

       |Pre-Production   | 5-20       | 20GB       | embedded       |Reduce TTLs   |metrics_collector_heap_size=1024   |

       |               |           |           |               |to 3 Months   |hbase_regionserver_heapsize=1024   |

       |               |           |           |               |               |hbase_master_heapsize=512           |

       |               |           |           |               |               |hbase_master_xmn_size=128           |

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

       |Production       | 20-50       | 50GB       | embedded       | n.a.           |metrics_collector_heap_size=1024   |

       |               |           |           |               |               |hbase_regionserver_heapsize=1024   |

       |               |           |           |               |               |hbase_master_heapsize=512           |

       |               |           |           |               |               |hbase_master_xmn_size=128           |

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

       |Production       | 50-200   | 100GB       | embedded       | n.a.           |metrics_collector_heap_size=2048   |

       |               |           |           |               |               |hbase_regionserver_heapsize=2048   |

       |               |           |           |               |               |hbase_master_heapsize=2048           |

       |               |           |           |               |               |hbase_master_xmn_size=256           |

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

       |Production       | 200-400   | 200GB       | embedded       | n.a.           |metrics_collector_heap_size=2048   |

       |               |           |           |               |               |hbase_regionserver_heapsize=2048   |

       |               |           |           |               |               |hbase_master_heapsize=2048           |

       |               |           |           |               |               |hbase_master_xmn_size=512           |

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

       |Production       | 400-800   | 200GB       | distributed   | n.a.           |metrics_collector_heap_size=8192   |

       |               |           |           |               |               |hbase_regionserver_heapsize=122288   |

       |               |           |           |               |               |hbase_master_heapsize=1024           |

       |               |           |           |               |               |hbase_master_xmn_size=1024           |

       |               |           |           |               |               |regionserver_xmn_size=1024           |

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

       |Production       | 800+       | 500GB       | distributed   | n.a.           |metrics_collector_heap_size=12288   |

       |               |           |           |               |               |hbase_regionserver_heapsize=16384   |

       |               |           |           |               |               |hbase_master_heapsize=16384       |

       |               |           |           |               |               |hbase_master_xmn_size=2048           |

       |               |           |           |               |               |regionserver_xmn_size=1024           |

       +---------------+-----------+-----------+---------------+---------------+-----------------------------------+

9.1.4.5 移动 Metrics Collector (Moving the Metrics Collector)

使用如下过程将 Ambari Metrics Collector 移动到一个新的主机上：

   ① 在 Ambari Web , 停止 Ambari Metrics 服务

   ② 执行下列 API 调用来删除当前的 Metric Collector 组件：

       curl -u admin:admin -H "X-Requested-By:ambari" - i -X   \

    DELETE http://ambari.server:8080/api/v1/clusters/cluster.name/hosts/metrics.collector.hostname/host_components/METRICS_COLLECTOR

   ③ 执行下列 API 调用在新主机上添加 Metric Collector：

       curl -u admin:admin -H "X-Requested-By:ambari" - i -X   \

       POST http://ambari.server:8080/api/v1/clusters/cluster.name/hosts/metrics.collector.hostname/host_components/METRICS_COLLECTOR

   ④ 在 Ambari Web, 导航到安装了新 Metrics Collector 的主机上并单击 Install the Metrics Collector

   ⑤ 在 Ambari Web, 启动 Ambari Metrics 服务

9.1.4.6 (可选)为 HBase 启动单独的 Region, Table, and User Metrics (Enabling Individual Region, Table, and User Metrics for HBase)

不像 HBase RegionServer metrics, Ambari 默认禁用 per region, per table, and per user metrics, 因为这些 metrics 非常多因而会导致性能问题。

如果要 Ambari 收集这些 metrics, 可以重新启用它们。然而，要首先测试这个选项并确认 AMS 性能可接受。

   ① 在 Ambari Server 上，浏览到如下位置：

       /var/lib/ambari-server/resources/common-services/HBASE/0.96.0.2.0/package/templates

   ② 编辑如下模板文件：

       hadoop-metrics2-hbase.properties-GANGLIA-MASTER.j2

       hadoop-metrics2-hbase.properties-GANGLIA-RS.j2

   ③ 注释掉或者删除下面的行

       *.source.filter.class=org.apache.hadoop.metrics2.filter.RegexFilter

       hbase.*.source.filter.exclude=.*(Regions|Users|Tables).*

   ④ 保存模板文件并重启 Ambari Server 使修改生效。

   重要提示：

       如果 Ambari 升级到一个新的版本，必须要重新对模板文件进行上述修改

9.1.5 AMS 高可用性 (AMS High Availability)

Ambari 默认安装 Ambari Metrics System (AMS) 到集群中一个 Metrics Collector 组件。Collector 是运行在集群的一个特定主机上的守护进程，从注册

的发布者接收数据，Monitors 和 Sinks .

取决于需要，可以要求 AMS 有两个 Collector 来形成高可用性情形。

前提：

   必须部署 AMS 为分布式模式(not embedded)

步骤：

   ① 在 Ambari Web 中，浏览到打算安装另一个收集器的主机

   ② 在 Hosts 页面，选取 +Add

   ③ 从列表上选取 Metrics Collector



       Ambari 安装新的 Metrics Collector 并配置 HA 的 Ambari Metrics

       新安装的收集器处于 "stopped" 状态

   ④ 在 Ambari Web 中，启动新的 Collector 组件

   Note：

   如果在安装第二个 Collector 到集群中之前没有将 AMS 切换为分布式模式，第二个收集器会被安装，但不会启动。

9.1.6 AMS 安全性 (AMS Security)

9.1.6.1 修改 Grafana 管理员密码 (Changing the Grafana Admin Password)

如果需要在初始安装 Ambari 之后修改 Grafana 管理员密码，可以直接在 Grafana 中修改密码，然后在 Ambari Metrics 配置中做同样的修改。

   (1) 在 Ambari Web 中, 浏览到 Services > Ambari Metrics, 选择 Quick Links, 然后选取 Grafana

       Grafana UI 以只读方式打开

   (2) 单击 Sign In

   (3) 以管理员登录，使用未更改的密码 admin/admin

   (4) 单击 admin 标签以查看管理员信息，单击 Change password

   (5) 输入未改变的密码，输入并确认新密码，然后单击 Change password 按钮

   (6) 回到 Ambari Web > Services > Ambari Metrics, 然后浏览 Configs tab

   (7) 在 General 部分，使用新密码更新并确认 Grafana Admin Password

   (8) 保存配置并重启服务，如果提示。

9.1.6.2 为 AMS 设置 HTTPS (Set Up HTTPS for AMS)

如果要限制访问 AMS 通过 HTTPS 连接，必须提供一个证书。起初测试的时候可以使用自签名的证书，但不适用于生产环境。在获得了一个证书之后，必须

运行特定的安装命令(setup command)。

步骤：

   (1)   创建自己的 CA 证书(CA certificate)

       openssl req -new -x509 -keyout ca.key -out ca.crt -days 365

   (2)   导入 CA 证书到信任站 (truststore)

       # keytool -keystore /<path>/truststore.jks -alias CARoot -import -file ca.crt -storepass bigdata

   (3) 检查 truststore

       # keytool -keystore /<path>/truststore.jks -list

       Enter keystore password:

       Keystore type: JKS

       Keystore provider: SUN

       Your keystore contains 2 entries

       caroot, Feb 22, 2016, trustedCertEntry,

       Certificate fingerprint (SHA1):

       AD:EE:A5:BC:A8:FA:61:2F:4D:B3:53:3D:29:23:58:AB:2E:B1:82:AF

   (4) 为 AMS Collector 生成证书并存储私钥到 keystore.

       # keytool -genkey -alias c6401.ambari.apache.org -keyalg RSA -keysize 1024

       -dname "CN=c6401.ambari.apache.org,OU=IT,O=Apache,L=US,ST=US,C=US" -keypass

       bigdata -keystore /<path>/keystore.jks -storepass bigdata

   (5) 为 AMS collector 证书创建证书请求(certificate request)

       keytool -keystore /<path>/keystore.jks -alias c6401.ambari.apache.org -certreq -file c6401.ambari.apache.org.csr -storepass bigdata

   (6) 利用 CA 证书为证书请求签名

       openssl x509 -req -CA ca.crt -CAkey ca.key -in c6401.ambari.apache.org.csr

       -out c6401.ambari.apache.org_signed.crt -days 365 -CAcreateserial -passin

       pass:bigdata

   (7) 把 CA 证书导入到 keystore.

       keytool -keystore /<path>/keystore.jks -alias CARoot -import -file ca.crt -storepass bigdata

   (8) 导入签名的证书到 keystore.

       keytool -keystore /<path>/keystore.jks -alias c6401.ambari.apache.org -

       import -file c6401.ambari.apache.org_signed.crt -storepass bigdata

   (9) 检查 keystore.

       caroot2, Feb 22, 2016, trustedCertEntry,

       Certificate fingerprint (SHA1):

       7C:B7:0C:27:8E:0D:31:E7:BE:F8:BE:A1:A4:1E:81:22:FC:E5:37:D7

       [root@c6401 tmp]# keytool -keystore /tmp/keystore.jks -list

       Enter keystore password:

       Keystore type: JKS

       Keystore provider: SUN

       Your keystore contains 2 entries

       caroot, Feb 22, 2016, trustedCertEntry,

       Certificate fingerprint (SHA1):

       AD:EE:A5:BC:A8:FA:61:2F:4D:B3:53:3D:29:23:58:AB:2E:B1:82:AF

       c6401.ambari.apache.org, Feb 22, 2016, PrivateKeyEntry,

       Certificate fingerprint (SHA1):

       A2:F9:BE:56:7A:7A:8B:4C:5E:A6:63:60:B7:70:50:43:34:14:EE:AF

   (10) 复制 /<path>/truststore.jks 文件到所有节点的 /<path>/truststore.jks 并设置合适的访问权限

   (11) 复制 /<path>/keystore.jks 文件到 AMS 收集器节点只到 /<path>/keystore.jks 路径，并设置合适的访问权限。建议设置 ams 用户为文件 owner, 并设置

       访问权限为 400

(12) 在 Ambari Web 中，更新 AMS 配置，setams-site/timeline.metrics.service.http.policy=HTTPS_ONLY

       •ams-ssl-server/ssl.server.keystore.keypassword=bigdata

       •ams-ssl-server/ssl.server.keystore.location=/<path>/keystore.jks

       •ams-ssl-server/ssl.server.keystore.password=bigdata

       •ams-ssl-server/ssl.server.keystore.type=jks

       •ams-ssl-server/ssl.server.truststore.location=/<path>/truststore.jks

       •ams-ssl-server/ssl.server.truststore.password=bigdata

       •ams-ssl-server/ssl.server.truststore.reload.interval=10000

       •ams-ssl-server/ssl.server.truststore.type=jks

       •ams-ssl-client/ssl.client.truststore.location=/<path>/truststore.jks

       •ams-ssl-client/ssl.client.truststore.password=bigdata

       •ams-ssl-client/ssl.client.truststore.type=jks

       •ssl.client.truststore.alias=<Alias used to create certificate for AMS. (Default is hostname)>

   (13) 重启服务

   (14) 配置 Ambari server 使用 truststore

       # ambari-server setup-security

       Using python /usr/bin/python

       Security setup options...

       ===========================================================================

       Choose one of the following options:

       [1] Enable HTTPS for Ambari server.

       [2] Encrypt passwords stored in ambari.properties file.

       [3] Setup Ambari kerberos JAAS configuration.

       [4] Setup truststore.

       [5] Import certificate to truststore.

       ===========================================================================

       Enter choice, (1-5): 4

       Do you want to configure a truststore [y/n] (y)?

       TrustStore type [jks/jceks/pkcs12] (jks):jks

       Path to TrustStore file :/<path>/keystore.jks

       Password for TrustStore:

       Re-enter password:

       Ambari Server 'setup-security' completed successfully.

   (15) 配置 ambari server 在请求 AMS Collector 时使用 https 替代 http：



       # echo "server.timeline.metrics.https.enabled=true" >> /etc/ambari-server/conf/ambari.properties

   (16) 重启 ambari server

9.1.6.3 为 Grafana 设置 HTTPS (Set Up HTTPS for Grafana)

如果要限制访问 Grafana 通过 HTTPS 连接，必须提供一个证书。起初测试的时候可以使用自签名的证书，但不适用于生产环境。在获得了一个证书之后，

必须运行特定的安装命令(setup command)。

步骤：

   (1) 登录到 Grafana 主机上

   (2) 浏览到 Grafana 配置目录

       cd /etc/ambari-metrics-grafana/conf/

   (3) 定位到证书

       如果要创建一个临时的自签名证书，可以运行：

       openssl genrsa -out ams-grafana.key 2048

       openssl req -new -key ams-grafana.key -out ams-grafana.csr

       openssl x509 -req -days 365 -in ams-grafana.csr -signkey ams-grafana.key -

       out ams-grafana.crt



   (4) 设置证书和秘钥文件的所有者和权限，让 Grafana 可以访问

       chown ams:hadoop ams-grafana.crt

       chown ams:hadoop ams-grafana.key

       chmod 400 ams-grafana.crt

       chmod 400 ams-grafana.key

       对于 non-root Ambari user, 使用：

       chmod 444 ams-grafana.crt

       让 agent user 可以读取文件

   (5) 在 Ambari Web, 浏览到 Services > Ambari Metrics > Configs

   (6) 在 Advanced ams-grafana-ini 部分更新如下属性：

       protocol https

       cert_file /etc/ambari-metrics-grafana/conf/ams-grafana.crt

       cert-Key /etc/ambari-metrics-grafana/conf/ams-grafana.key

   (7) 保存配置并重启服务，如果提示。

9.2 Ambari 日志搜索 (Ambari Log Search, Technical Preview)

下面几节描述 Ambari Log Search 的技术概览(Technical Preview), 只能在少于 150 个节点的非生产环境集群上使用。

9.2.1 Ambari 日志搜索体系结构 (Log Search Architecture)

Ambari Log Search 可以搜索由 Ambari-managed HDP 组件生成的日志。Ambari Log Search 依赖于由 Apache Solr 索引服务提供的 Ambari Infra 服务。

两个组件组成了 Log Search 解决方案：

   • Log Feeder

   • Log Search Server

9.2.1.1 Log Feeder

Log Feeder 组件分析组件日志。Log Feeder 被部署到集群的所有节点上，并与该节点上所有的组件日志交互。启动时，Log Feeder 开始分析所有已知的

组件日志并把它们发送给 Apache Solr 实例(由 Ambari Infra 服务管理) 以进行索引。

默认情况下，只有 FATAL, ERROR, and WARN 日志被 Log Feeder 捕捉。可以利用 Log Search UI 过滤器设置来临时或永久地添加其他日志级别。

9.2.1.2 Log Search Server

Log Search Server 承载着 Log Search UI web 应用程序，为 Ambari 提供 API, 并且 Log Search UI 访问已索引的组件日志。作为本地或 LDAP 用户登录

之后，可以利用 Log Search UI 可视化，浏览，以及搜索索引化了的组件日志。

9.2.2 Installing Log Search

Log Search 是 Ambari 2.4 及以后版本的内置服务。可以在一个新的安装过程中通过 +Add Service 菜单安装。 Log Feeders 自动安装到集群的所有节点上

可以手动将 Log Search Server 安装到与 Ambari Server 同一部主机上。

9.2.3 使用 Log Search (Using Log Search)

使用 Log Search 包括如下活动：

   • Accessing Log Search

   • Using Log Search to Troubleshoot

   • Viewing Service Logs

   • Viewing Access Logs

9.2.3.1 访问 Log Search (Accessing Log Search)

Log Search 安装之后，可以利用如下三种方法搜索索引化的日志：

   • Ambari Background Ops Log Search Link

   • Host Detail Logs Tab

   • Log Search UI

9.2.3.1.1 Ambari 后台操作日志搜索链接 (Ambari Background Ops Log Search Link)

当执行生命周期操作时，例如启动或停止服务，访问日志可以有助于从潜在的失败中恢复，这是非常重要的。这些日志在 Background Ops 中现在是可用的。

Background Ops 也链接到 Host Detail Logs tab, 列出所有的索引化的日志文件，并可以在一个主机上查看。

9.2.3.1.2 Ambari 后台操作日志搜索链接 (Ambari Background Ops Log Search Link)

Logs tab 页添加到每一个主机的 host detail 页面，包含一个索引的列表，可查看的日志文件，通过 service, component, type 组织。可以通过一个

到 Log Search UI 的链接打开并搜索这些文件。

9.2.3.1.3 Log Search UI

Log Search UI 是一个特定目的构建的 web 应用程序用于搜索 HDP 组件日志。这个 UI 专注于快速访问和从一个单点位置搜索日志。日志可以由日志级别，

组件，以及可以搜索的关键字过滤。

Log Search UI 可以从 Ambari Web 的 Log Search Service 的 Quick Links 访问。

9.2.3.2 利用 Log Search 进行故障处理(Using Log Search to Troubleshoot)

要查找特定问题关联的日志，在 UI 中使用 Troubleshooting 选项卡，选择与该问题关联的服务，组件，以及时间。例如，选择 HDFS, UI 自动搜索 HDFS

相关的组件。可以选择一个昨天或上周的时间帧，或一个自定义的值。当准备好查看匹配的日志时，单击 Go to Logs:

9.2.3.3 查看服务日志 (Viewing Service Logs)

Service Logs tab 可用于搜索横跨所有组件日志，通过关键字或特定日志级别的过滤器，组件，以及时间区间。UI 经过组织，可以快速看到每个级别日志

有多少日志捕捉到，查找关键字，包括排除的组件，匹配查询的日志。

9.2.3.4 查看访问日志 (Viewing Access Logs)

当要处理 HDFS 相关的问题时，可以发现搜索 HDFS 用户访问趋势很有帮助。Access Logs tab 可以查看 HDFS 审计日志，聚集数据使用显示 top ten HDFS

用户，以及 top ten 文件系统资源访问。这能帮助找到异常现象，或热点和冷点数据集。

9.3 Ambari Infra

HDP 中很多服务依赖于核心服务来索引数据。例如，Apache Atlas 利用索引服务进行 lineage-free 文本搜索，Apache Ranger 对审计数据进行索引。

Ambari Infra 的角色是为安装栈上组件提供公共索引服务。

当前， Ambari Infra Service 只有一个组件：Infra Solr Instance. Infra Solr Instance 是一个完全托管的 Apache Solr 安装。默认情况下，Ambari

Infra Service 在选择安装时，部署一个单节点的 SolrCloud 安装，但可以安装多个 Infra Solr Instances , 这样就可以有一个分布式索引并为 Atlas,

Ranger, and LogSearch 提供搜索。

要安装多个 Infra Solr Instances, 可以简单地通过 Ambari 的 +Add Service 功能把它们添加到现有的集群主机上。部署的 Infra Solr Instances 的数量

取决于集群的节点数量和部署的服务。

因为一个 Ambari Infra Solr Instance 用于多个 HDP 组件，因此在重启服务时要小心，避免扰乱这些依赖的服务。 HDP 2.5 及以后版本，Atlas, Ranger,

and Log Search 依赖于 Ambari Infra Solr Instance 。

   Note：

       Infra Solr Instance 是仅为 HDP 组件使用的，不支持第三方组件或应用程序。

9.3.1 存档和清理数据 (Archiving & Purging Data)

大型集群会产生很多的日志内容，Ambari Infra 提供了一个便利工具用于存档和清理不再需要的日志。

工具成为 Solr Data Manager. Solr Data Manager 是一个 python 程序，安装路径为 /usr/bin/infra-solr-data-manager 。此程序使用户可以快速存档，

删除，或保存 Solr 集合的数据。

9.3.1.1 命令行选项 (Command Line Options)

● 操作模式(Operation Modes)

   -m MODE, --mode=MODE archive | delete | save

   使用的模式取决于要执行的操作：

   archive   : 用于将数据存储到存储媒体，并在存储完成之后删除数据

   delete   : 即删除

   save   : 类似于 archive, 除了数据保存后不会被删除

   ● 连接到 Solr(Connecting to Solr)

   -s SOLR_URL, --solr-url=<SOLR_URL>

   URL 用于连接到特定的 Solr Cloud 实例

   例如，http://c6401.ambari.apache.org:8886/solr

   ● -c COLLECTION, --collection=COLLECTION

   Solr 集合(collection) 的名称，如，'hadoop_logs'

   ● -k SOLR_KEYTAB,--solr-keytab=SOLR_KEYTAB

   使用的 keytab 文件，用于 kerberized Solr 实例

   ● -n SOLR_PRINCIPAL, --solr-principal=SOLR_PRINCIPAL

   使用的 principal 名称，用于 kerberized Solr 实例

   ● Record Schema

   -i ID_FIELD, --id-field=ID_FIELD

   solr schema 中字段名称，用于唯一标识每条记录

   -f FILTER_FIELD, --filter-field=FILTER_FIELD

   solr schema 中用于过滤掉的字段名称，如，'logtime'

   -o DATE_FORMAT, --date-format=DATE_FORMAT

   The custom date format to use with the -d DAYS field to match log entries that are older than a certain number of days.

   -e END

   Based on the filter field and date format, this argument configures the date that should be used as the end of the date range. If you

   use '2018-08-29T12:00:00.000Z', then any records with a filter field that is after that date will be saved, deleted, or archived

   depending on the mode.

   -d DAYS, --days=DAYS

   Based on the filter field and date format, this argument configures the number days before today should be used as the end of the range.

   If you use '30', then any records with a filter field that is older than 30 days will be saved, deleted, or archived depending on the mode.

   -q ADDITIONAL_FILTER, --additional-filter=ADDITIONAL_FILTER

   Any additional filter criteria to use to match records in the collection

   ● Extracting Records

   -r READ_BLOCK_SIZE, --read-block-size=READ_BLOCK_SIZE

   The number of records to read at a time from Solr. For example: '10' to read 10 records at a time.

   -w WRITE_BLOCK_SIZE, --write-block-size=WRITE_BLOCK_SIZE

   The number of records to write per output file. For example: '100' to write 100 records per file.

   -j NAME, --name=NAME name included in result files

   Additional name to add to the final filename created in save or archive mode.

   --json-file

   Default output format is one valid json document per record delimited by a newline. This option will write out a single valid JSON

   document containing all of the records.

   -z COMPRESSION, --compression=COMPRESSION none | tar.gz | tar.bz2 | zip | gz

   Depending on how output files will be analyzed, you have the choice to choose the optimal compression and file format to use for output

   files. Gzip compression is used by default.

   ● Writing Data to HDFS

   -a HDFS_KEYTAB, --hdfs-keytab=HDFS_KEYTAB

   The keytab file to use when writing data to a kerberized HDFS instance.

   -l HDFS_PRINCIPAL, --hdfs-principal=HDFS_PRINCIPAL

   The principal name to use when writing data to a kerberized HDFS instance

   -u HDFS_USER, --hdfs-user=HDFS_USER

   The user to connect to HDFS as

   -p HDFS_PATH, --hdfs-path=HDFS_PATH

   The path in HDFS to write data to in save or archive mode.

   ● Writing Data to S3

   -t KEY_FILE_PATH, --key-file-path=KEY_FILE_PATH

   The path to the file on the local file system that contains the AWS Access and Secret Keys. The file should contain the keys in this

   format: <accessKey>,<secretKey>

   -b BUCKET, --bucket=BUCKET

   The name of the bucket that data should be uploaded to in save or archive mode.

   -y KEY_PREFIX, --key-prefix=KEY_PREFIX

   The key prefix allows you to create a logical grouping of the objects in an S3 bucket. The prefix value is similar to a directory name

   enabling you to store data in the same directory in a bucket. For example, if your Amazon S3 bucket name is logs, and you set prefix

   to hadoop/, and the file on your storage device is hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz, then the file would be identified

   by this URL: http://s3.amazonaws.com/logs/hadoop/hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz

   -g, --ignore-unfinished-uploading

   To deal with connectivity issues, uploading extracted data can be retried. If you do not wish to resume uploads, use the -g flag to

   disable this behaviour.

   ● Writing Data Locally

   -x LOCAL_PATH, --local-path=LOCAL_PATH

   The path on the local file system that should be used to write data to in save or archive mode

   ● 示例

   □ 删除索引的数据 (Deleting Indexed Data)：

   delete 模式 (-m delete), 程序从 Solr collection 中删除数据。这个模式利用过滤器字段(-f FITLER_FIELD) 选项来控制哪些数据从索引中删除。

   下面的命令会从 hadoop_logs collection 中删除日志项，August 29, 2017 以前创建的，使用 -f 选项指定的 Solr collection 字段作为过滤器字段，

   -e 选项标识要删除的区间结尾

   infra-solr-data-manager -m delete -s ://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -e 2017-08-29T12:00:00.000Z

   □ 存档索引数据 (Archiving Indexed Data)

   archive 模式，程序从 Solr collection 中获取数据并写出到 HDFS 或 S3, 然后删除数据。

   程序会从 Solr 抓取数据并在达到写入块大小，或 Solr 中没有匹配的数据时创建文件。程序跟踪抓取记录的进度，由过滤字段和 id 字段排序，并且

   总是会保存它们最后的值。一旦文件写入，利用配置的压缩类型对其进行压缩。

   压缩的文件创建之后，程序创建一个命令文件包含下一步的指导。在下一步操作期间遇到任何中断或错误，程序会启动保存的命令文件，因此所有数据会

   是一致的。如果无效的配置导致错误，一致性失败， -g 选项可用于忽略保存的命令文件。程序支持将数据写入到 HDFS, S3, 或本地文件。

   下面的命令会从 http://c6401.ambari.apache.org:8886/solr 访问 solr collection hadoop_logs, 基于字段的 logtime, 并抽取出每过 1 天，一次

   读取 10 个文档，写出 100 个文档到一个文件，并复制这些 zip 文件到本地 /tmp 目录。

   infra-solr-data-manager -m archive -s http://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -d 1 -r 10 -w 100 -x /tmp -v

   □ 保存索引数据 (Saving Indexed Data)

   保存数据类似于存档数据，除了文件创建和上传之后不会被删除之外。建议在运行存档模式之前使用 save 模式测试，数据按预期的方式写入。

   一下命令会存储最后 3 天的 HDFS 审计日志到 HDFS 路径 "/" hdfs 用户，从 kerberized Solr 抓取数据。

   infra-solr-data-manager -m save -s http://c6401.ambari.apache.org:8886/solr -c audit_logs -f logtime -d 3 -r 10 -w 100

   -q type:\"hdfs_audit\" -j hdfs_audit -k /etc/security/keytabs/ambari-infra-solr.service.keytab -n

   infra-solr/c6401.ambari.apache.org@AMBARI.APACHE.ORG -u hdfs -p /

9.3.2 Ambari Infra 性能调优 (Performance Tuning for Ambari Infra)

利用 Ambari Infra 索引和存储 Ranger 审计日志时，应正确调整 Solr 来处理每日的审计日志存储的数量。下面几节描述调整操作系统和 Solr 的建议，

基于在环境中如何利用 Ambari Infra 和 Ranger

9.3.2.1 操作系统调优 (Operating System Tuning)

Solr 在建立索引和搜索时需要使用很多的网络连接，为了避免打开过多的网络连接，建议如下 sysctl 参数：

   net.ipv4.tcp_max_tw_buckets = 1440000

   net.ipv4.tcp_tw_recycle = 1

   net.ipv4.tcp_tw_reuse = 1

这些设置可以永久性设置在 /etc/sysctl.d/net.conf 文件中，或者运行时使用如下 sysctl 命令设置：

   sysctl -w net.ipv4.tcp_max_tw_buckets=1440000

   sysctl -w net.ipv4.tcp_tw_recycle=1

   sysctl -w net.ipv4.tcp_tw_reuse=1

另外，应该提升 solr 的用户进程数量以避免创建纯新线程异常。这可以通过创建一个名称为 etc/security/limits.d/infra-solr.conf 新文件实现，其中

包含如下内容：

   infra-solr - nproc 6000

9.3.2.2 设置 JVM - GC (JVM - GC Settings)

堆大小和垃圾回收设置对于生成环境索引很多的 Ranger 审计日志的 Solr 实例非常重要。对于生产环境的部署，建议设置 "Infra Solr Minimum Heap Size,"

和 "Infra Solr Maximum Heap Size" 为 12 GB. 这些设置可以通过如下步骤实现：

   ① 在 Ambari Web 中，浏览到 Services > Ambari Infra > Configs

   ② 在 Settings tab, 可以看到有两个滑动条控制 Infra Solr Heap Size

   ③ 设置 Infra Solr Minimum Heap Size 为 12GB 或 12,288MB

   ④ 设置 Infra Solr Maximum Heap Size 为 12GB 或 12,288MB

   ⑤ 单击 Save 保存配置，然后按照 Ambari 提示重启相关服务。

   在生产环境部署中使用 G1 作为垃圾回收机制也是推荐的设置。要为 Ambari Infra Solr 实例设置 G1 垃圾回收，通过如下步骤实现：

   ① 在 Ambari Web 中，浏览到 Services > Ambari Infra > Configs

   ② 在 Advanced tab 展开 Advanced infra-solr-env

   ③ 在 infra-solr-env template 定位到多路 GC_TUNE 环境变量定义，以如下内容替换：

       GC_TUNE="-XX:+UseG1GC

           -XX:+PerfDisableSharedMem

           -XX:+ParallelRefProcEnabled

           -XX:G1HeapRegionSize=4m

           -XX:MaxGCPauseMillis=250

           -XX:InitiatingHeapOccupancyPercent=75

           -XX:+UseLargePages

           -XX:+AggressiveOpts"

   用于 -XX:G1HeapRegionSize 的值是基于 12GB Solr Maximum Heap Size. 如果为 Solr 选择使用不同的堆大小, 参考下表建议：



           +-----------------------+---------------------------+

           | Heap Size               |   G1HeapRegionSize       |

           +-----------------------+---------------------------+

           | < 4GB                   | 1MB                       |

           +-----------------------+---------------------------+

           | 4-8GB                   | 2MB                       |

           +-----------------------+---------------------------+

           | 8-16GB               | 4MB                       |

           +-----------------------+---------------------------+

           | 16-32GB               | 8MB                       |

           +-----------------------+---------------------------+

           | 32-64GB               | 16MB                       |

           +-----------------------+---------------------------+

           | >64GB                   | 32MB                       |

           +-----------------------+---------------------------+

9.3.2.3 环境特定的调节参数 (Environment-Specific Tuning Parameters)

下面的每个建议都依赖于每日索引的审计记录的数量。快速确定每日建立索引的审计记录数量，利用如下命令：

使用一个 HTTP client 例如 curl, 执行下列命令：

   curl -g "http://<ambari infra hostname>:8886/solr/ranger_audits/select?q=(evtTime:[NOW-7DAYS+TO+*])&wt=json&indent=true&rows=0"

会收到类似如下的消息：

   {

       "responseHeader":{

       "status":0,

       "QTime":1,

       "params":{

       "q":"evtTime:[NOW-7DAYS TO *]",

       "indent":"true",

       "rows":"0",

       "wt":"json"}},

       "response":{"numFound":306,"start":0,"docs":[]

   }}

   利用 response 的 numFound 元素值除以 7 获得每天索引的审计日志数量。如果必要，也可以替换 curl 请求中的 '7DAYS' 为一个更宽泛的时间区间，

   可以使用下列关键字：

       • 1MONTHS

       • 7DAYS

   如果改变查询的时间区间，确保除以合适的数值。每日的平均记录数用于识别如下建议的应用环境。

   ● Less Than 50 Million Audit Records Per Day

   基于 Solr REST API 调用，如果平均每日记录数少于 50 million, 应用如下建议。在每个建议中，time to live, or TTL 控制一个文档被保持在索引

   中多长时间被移除需要考虑进去。默认 TTL 为 90 days, 但有些用户选择更激进些，从索引移除文档定为 30 days. 由于这个原因，对这两种 TTL 设置

   提供建议。

   这些建议假设使用我们推荐的每个 Solr server 实例使用 12GB 堆大小。

   Default Time To Live (TTL) 90 days:

   • Estimated total index size: ~150 GB to 450 GB

   • Total number of primary/leader shards: 6

   • Total number of shards including 1 replica each: 12

   • Total number of co-located Solr nodes: ~3 nodes, up to 2 shards per node(does not include replicas)

   • Total number of dedicated Solr nodes: ~1 node, up to 12 shards per node(does not include replicas)

   ● 50 - 100 Million Audit Records Per Day

   50 to 100 million records ~ 5 - 10 GB data per day.

   Default Time To Live (TTL) 90 days:

   • Estimated total index size: ~ 450 - 900 GB for 90 days

   • Total number of primary/leader shards: 18-36

   • Total number of shards including 1 replica each: 36-72

   • Total number of co-located Solr nodes: ~9-18 nodes, up to 2 shards per node(does not include replicas)

   • Total number of dedicated Solr nodes: ~3-6 nodes, up   to 12 shards per node(does not include replicas)

   Custom Time To Live (TTL) 30 days:

   • Estimated total index size: 150 - 300 GB for 30 days

   • Total number of primary/leader shards: 6-12

   • Total number of shards including 1 replica each: 12-24

   • Total number of co-located Solr nodes: ~3-6 nodes, up to 2 shards per node(does not include replicas)

   • Total number of dedicated Solr nodes: ~1-2 nodes, up to 12 shards per node(does not include replicas)

   ● 100 - 200 Million Audit Records Per Day

   100 to 200 million records ~ 10 - 20 GB data per day.

   Default Time To Live (TTL) 90 days:

   • Estimated total index size: ~ 900 - 1800 GB for 90 days

   • Total number of primary/leader shards: 36-72

   • Total number of shards including 1 replica each:   72-144

   • Total number of co-located Solr nodes: ~18-36 nodes, up to 2 shards per node(does not include replicas)

   • Total number of dedicated Solr nodes: ~3-6 nodes, up to 12 shards per node (does not include replicas)

   Custom Time To Live (TTL) 30 days:

   • Estimated total index size: 300 - 600 GB for 30 days

   • Total number of primary/leader shards: 12-24

   • Total number of shards including 1 replica each: 24-48

   • Total number of co-located Solr nodes: ~6-12 nodes, up to 2 shards per node(does not include replicas)

   • Total number of dedicated Solr nodes: ~1-3 nodes, up to 12 shards per node(does not include replicas)

如果选择使用至少 1 个副本来提供可用性，提升节点数量。如果要求高可用性，考虑配置中使用不小于 3 的 Solr 节点。

如例子中演示的，较低的 TTL 要求较少的资源。如果要长期保留数据，可以利用 SolrDataManager 将数据存档到长期存储系统(HDFS, S3), 并提供 Hive 表以

提供容易的数据查询。这种策略下，热点数据可以存储在 Solr 中以提供 Ranger UI 的快速访问，不活跃的数据存档到 HDFS 或 S3, 可以通过 Ranger 访问。

9.3.2.4 添加新的 Shards (Adding New Shards)

如果查看以上建议之后，需要添加额外的 shards 到现有部署，参考如下 Solr 文档帮助理解如何完成这一任务：

   https://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.5.pdf

9.3.2.5 内存溢出异常 (Out of Memory Exceptions)

当利用 Ambari Infra 和 Ranger Audit 一起使用时，如果看到很多 Solr 实例以 Java "Out Of Memory" 异常退出，一个解决方案是通过启用 DocValues

来升级 Ranger Audit schema 使用更少的堆内存。这样修改要求重新对数据建立索引而且具有破坏性，但非常有助于处理内存消耗。参考文章：

   https://community.hortonworks.com/articles/156933/restore-backup-ranger-audits-to-newly-collection.html

参考链接：Ambari 操作指南 (Ambari Operations)

https://blog.csdn.net/devalone/article/details/80781652

https://blog.csdn.net/devalone/article/details/80800262

https://blog.csdn.net/devalone/article/details/80813176

https://blog.csdn.net/devalone/article/details/80826036

https://blog.csdn.net/devalone/article/details/80839371

https://blog.csdn.net/devalone/article/details/80854431

码农公寓

相关文章