Home > hp-ux, OS & 小型机 > HP-UX下使用EMS

HP-UX下使用EMS

1. EMS介绍
EMS(Event Monitoring Service)是一项HP-UX的集成服务,它能够对主机硬件进行实时监控,并可以通过指定方式将监控信息报告给系统维护人员,这有助于运维人员及时、准确的发现主机故障,并辅助判定故障所在,提高主机的可用时间。
EMS可以通过MRM(Monitoring Request Manager)进行管理,通过MRM可以对EMS的监控范围、事情报警触发条件、事件信息报警方式进行设置。
MRM调用方法如下:

(1)用root身份登陆主机系统
(2)运行/etc/opt/resmon/lbin/monconfig
(3)通过(MRM)Monitoring Request Manager Main Menu进行配置
在MRM菜单中,可以查看、检查、修改、删除、启用、禁用检测器。
如下:
============================================================================
=================== Event Monitoring Service ===================
=================== Monitoring Request Manager ===================
============================================================================

EVENT MONITORING IS CURRENTLY ENABLED.
EMS Version : A.04.10
STM Version : C.46.15

============================================================================
============== Monitoring Request Manager Main Menu ==============
============================================================================

Note: Monitoring requests let you specify the events for monitors
to report and the notification methods to use.

Select:
(S)how monitoring requests configured via monconfig
(C)heck detailed monitoring status
(L)ist descriptions of available monitors
(A)dd a monitoring request
(D)elete a monitoring request
(M)odify an existing monitoring request
(E)nable Monitoring
(K)ill (disable) monitoring
(H)elp
(Q)uit
Enter selection: [s]

下面以定制一个monitor为例子,说明MRM的配置方法:
(1)以root身份登陆系统
(2)运行/etc/opt/resmon/lbin/monconfig进入MRM主菜单(就是上面看到的)
(3)选择a并回车,对应的功能选项是(A)dd a monitoring request
(4)此时将显示出可供监控的硬件模块,一般全选,键入a并回车就行
(5)选择基准事件级别,建议选择2)MINOR WARNING
(6)选择报警触发的条件,选择4)>=
(7)选择监控事件信息报警的方式,选择6)EMAIL
(8)选择事件报警邮件的接收人,这里可根据需要输入相应的用户名,例如:monitor
(9)对此次monitor进行注释说明,选择(A)dd
(10)Client Configuration File,这里选择(C)lear
(11)保存上述配置信息,此后将退回到主菜单
(12)在主菜单下,选择(S)how monitoring requests configured via monconfig查看新建的monitor是否存在
(13)退回到MRM主菜单,选择(C)heck detailed monitoring status,可查看所有有效的监控状态,因主机配置而异,对于主机中不存在的硬件,EMS将会忽略,即使在上述第“4”步中设置为监控所有硬件
(14)(E)nable Monitoring,开启EMS服务功能
说明:通过上述步骤,我们新建的monitor是针对所有硬件模块(step 4)实时监控,但仅对严重程度大于等于Minor Warning(step 5 & step 6)的事件,通过email(step 6)的方式报告给用户monitor(step 8)。

2. 如何从event mail获取信息
EMS产生的时间警告邮件可通过内部网络接收,无需另外配置域名服务器。EMS产生的邮件,根据事先定义发给目标用户monitor,可通过PC上的邮件客户端软件(outlook等)进行接收。
以outlook为例子,为了接收event mail,邮件客户端软件需要新建邮件账号,用户名为在MRM中指定的HP-UX用户名,口令为HP-UX中对应的口令,pop3/smtp服务器为被检测主机的IP地址,建议outlook设定自动收取邮件的间隔时间,以便能及时收到来自EMS的事件信息。
说明:
(1)因为HP-UX自身的安全机制,root用户的e-mail无法通过客户端软件收取,因此在MRM中指定事件邮件接收用户时指定为其他普通用户,例如此次就新建了monitor这个用户
(2)网络中应该开放pop3/pop的110/109两个端口
(3)供event mail使用的用户是HP-UX中的用户,也能够登陆主机,建议定期修改HP-UX中该用户的密码,对应的,也要修改outlook的密码

下面举例说明EMS生成的事件报警邮件的内容,下述故障来自人为带电拔出一块硬盘导致的系统异常(中文部分为注释)

>———— Event Monitoring Service Event Notification ————<
Notification Time: Wed Jun 8 23:26:18 2005 事件触发时间
hpux1 sent Event Monitor notification information: 可反映主机名
/storage/events/disks/default/0_0_1_1.15.0 is >= 2. 硬件模块、触发器
Its current value is CRITICAL(5). 该事件严重程度
User Comments:
Just a test:)
Event data from monitor:
Event Time……….: Wed Jun 8 23:26:16 2005
Severity…………: CRITICAL
Monitor………….: disk_em
Event #………….: 101
System…………..: hpux1
Summary: 事件概述
Disk at hardware path 0/0/1/1.15.0 : Device removed from monitoring

Description of Error: 故障描述
The device has been removed from the list of devices being monitored by
this monitor.
Probable Cause / Recommended Action: 可能原因/推荐处理办法
The device was removed from the system, has stopped responding to the
system or it has been replaced with a device that is not supported by this
monitor.
Run ioscan to determine the state and type of the device.
Check the /var/stm/data/os_decode_xref for the information indicating
which devices are supported by this monitor.
Check other monitors to determine if they are now monitoring the
device by running /etc/opt/resmon/lbin/monconfig and using the “Check
monitoring” command.
Additional Event Data:
System IP Address…: 15.85.114.14 主机IP
Event Id…………: 0x42a70e1800000000
Monitor Version…..: B.01.01
Event Class………: I/O 事件类别
Client Configuration File………..:
/var/stm/config/tools/monitor/default_disk_em.clcfg
Client Configuration File Version…: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number………….: 9000/800/A500-44 主机model号
OS Version………………….: B.11.11 操作系统版本
STM Version…………………: A.45.00
EMS Version…………………: A.04.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/disk_em.htm#101

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v

Component Data:
Physical Device Path…: 0/0/1/1.15.0 故障设备物理路径
Device Class………..: Disk 设备类型
Inquiry Vendor ID……: SEAGATE 设备生产商
Inquiry Product ID…..: ST34572WC 产品号
Firmware Version…….: HP03 固件版本
Serial Number……….: JKJ118650QPJCX 故障备件序列号
>———- End Event Monitoring Service Event Notification ———-<

Enven mail中显示了故障发生的事件、主机名字、事件严重等级、故障盘的物理路径、硬盘的product ID、建议的检查步骤、主机型号、操作系统版本等信息,有助于发现并排查主机硬件故障。
但因主机硬件故障可能并非单一部件的简单故障,故event mail中Probable Cause / Recommended Action 描述有可能更最终发现确认的故障鉴定不一致,这是正常情形。往往对故障分析,需辅助更多的工具和手段进行排查。

Categories: hp-ux, OS & 小型机 Tags: , ,
  1. No comments yet.
  1. No trackbacks yet.

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word

Videos, Slideshows and Podcasts by Cincopa Wordpress Plugin