0x00 前言
最近碰到ES集群因JVM崩溃而宕机次数过多,为了能第一时间快速恢复和得到通知,所以打算搭建一个异常重启和告警的运维工具。首先,调研了三个程序:systemd、monit、supervisor,其中systemd是Centos7系统自带的,稳定性很好,而且配置比较熟悉方便上手,但是不支持告警。monit,功能很强大,不仅支持进程监控,还支持进程资源使用、目录、文件等等监控,并且对系统侵入较小,进程不必从monit启动,但其告警方式不支持自定义代码。supervisor,用python写的进程监控框架,只支持前台进程监控,如果一个进程运行在后台,那么是不能使用supervisor,但其告警方式通过event/listener,可以实现自定义代码监控。
由于使用钉钉机器人进行告警,并且ES可以在前台运行,所以选取了supervisor方案。
0x01 安装supervisor
Centos7开启了epel,可以通过yum安装python。如果没开启,使用以下命令安装。
11yum install -y epel-release
安装python3.6,epel附带的python最高只有3.6,也可以通过本地编译高版本的python。
11yum install -y python36
当python安装完成,通过pip安装supervisor。
11python3 -m pip install supervisor -i https://mirrors.aliyun.com/pypi/simple
0x02 systemd管理supervisor
由于ES集群采用的单节点32GB JVM配置,所以需要对Linux进行系统设置以下4个部分,其中supervisor默认继承的配置是vm.max_map_count
。nofile
和memlock
,都不会从系统配置读取,所以需要在systemd启动supervisor时设置。
51echo 'es soft nofile 655350' >> /etc/security/limits.conf && \
2echo 'es hard nofile 655350' >> /etc/security/limits.conf && \
3echo 'es - memlock unlimited' >> /etc/security/limits.conf && \
4echo 'vm.max_map_count=2621440' >> /etc/sysctl.conf && \
5sysctl -p
设置方式和supervisrd配置如下,创建/usr/lib/systemd/system/supervisord.service
文件,此时还不能启动supervisor,因为没有配置它的配置文件。
161[Unit]
2Description=Supervisor daemon
3
4[Service]
5Type=forking
6ExecStart=/usr/local/bin/supervisord -c /etc/supervisord.conf
7ExecStop=/usr/local/bin/supervisorctl -c /etc/supervisord.conf $OPTIONS shutdown
8ExecReload=/usr/local/bin/supervisorctl -c /etc/supervisord.conf $OPTIONS reload
9KillMode=process
10Restart=on-failure
11RestartSec=42s
12LimitNOFILE=655350
13LimitMEMLOCK=infinity
14
15[Install]
16WantedBy=multi-user.target
0x03 配置supervisor
创建/etc/supervisord.conf
文件,supervisor的配置文件大体上分为3部分:
第一部分是supervisr服务的配置,主要设置日志、子进程启动时的系统配置、unix sock等。
第二部分是需要被启动进程的配置,主要设置主目录、重启配置、日志、环境变量等。
第三部分是事件监听器配置,同第2部分。完整的配置如下:
841[unix_http_server]
2file=/var/run/supervisor.sock ; the path to the socket file
3
4[supervisord]
5logfile=/var/log/supervisord.log ; main log file; default $CWD/supervisord.log
6logfile_maxbytes=50MB ; max main logfile bytes b4 rotation; default 50MB
7logfile_backups=10 ; # of main logfile backups; 0 means none, default 10
8loglevel=info ; log level; default info; others: debug,warn,trace
9pidfile=/var/run/supervisord.pid ; supervisord pidfile; default supervisord.pid
10nodaemon=false ; start in foreground if true; default false
11silent=false ; no logs to stdout if true; default false
12minfds=655350 ; min. avail startup file descriptors; default 1024
13minprocs=4096 ; min. avail process descriptors;default 200
14
15
16[rpcinterface:supervisor]
17supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
18
19
20[supervisorctl]
21serverurl=unix:///var/run/supervisor.sock ; use a unix:// URL for a unix socket
22
23[program:es_node_a]
24command=/data1/es7/elasticsearch-7.4.0/bin/elasticsearch ; the program (relative uses PATH, can take args)
25process_name=%(program_name)s ; process_name expr (default %(program_name)s)
26numprocs=1 ; number of processes copies to start (def 1)
27directory=/data1/es7/elasticsearch-7.4.0/bin ; directory to cwd to before exec (def no cwd)
28priority=10 ; the relative start priority (default 999)
29autostart=true ; start at supervisord start (default: true)
30startsecs=120 ; # of secs prog must stay up to be running (def. 1)
31startretries=3 ; max # of serial start failures when starting (default 3)
32autorestart=true ; when to restart if exited after running (def: unexpected)
33exitcodes=0 ; 'expected' exit codes used with autorestart (default 0)
34stopsignal=TERM ; signal used to kill process (default TERM)
35stopwaitsecs=60 ; max num secs to wait b4 SIGKILL (default 10)
36user=es ; setuid to this UNIX account to run the program
37redirect_stderr=false ; redirect proc stderr to stdout (default false)
38stdout_logfile=NONE
39stderr_logfile=/data1/es_supervisor_error.log ; stdout log path, NONE for none; default AUTO
40environment=JAVA_HOME=""
41
42[program:es_node_b]
43command=/data1/es7b/elasticsearch-7.4.0/bin/elasticsearch ; the program (relative uses PATH, can take args)
44process_name=%(program_name)s ; process_name expr (default %(program_name)s)
45numprocs=1 ; number of processes copies to start (def 1)
46directory=/data1/es7b/elasticsearch-7.4.0/bin ; directory to cwd to before exec (def no cwd)
47priority=10 ; the relative start priority (default 999)
48autostart=true ; start at supervisord start (default: true)
49startsecs=120 ; # of secs prog must stay up to be running (def. 1)
50startretries=3 ; max # of serial start failures when starting (default 3)
51autorestart=true ; when to restart if exited after running (def: unexpected)
52exitcodes=0 ; 'expected' exit codes used with autorestart (default 0)
53stopsignal=TERM ; signal used to kill process (default TERM)
54stopwaitsecs=60 ; max num secs to wait b4 SIGKILL (default 10)
55user=es ; setuid to this UNIX account to run the program
56redirect_stderr=false ; redirect proc stderr to stdout (default false)
57stdout_logfile=NONE
58stderr_logfile=/data1/es_supervisor_error.log ; stdout log path, NONE for none; default AUTO
59environment=JAVA_HOME=""
60
61
62
63
64
65[eventlistener:es_event_listener]
66command=/data1/es_event_monitor.py ; the program (relative uses PATH, can take args)
67process_name=%(program_name)s ; process_name expr (default %(program_name)s)
68numprocs=1 ; number of processes copies to start (def 1)
69events=PROCESS_STATE_FATAL ; event notif. types to subscribe to (req'd)
70buffer_size=10 ; event buffer queue size (default 10)
71directory=/data1 ; directory to cwd to before exec (def no cwd)
72priority=-1 ; the relative start priority (default -1)
73autostart=true ; start at supervisord start (default: true)
74startsecs=10 ; # of secs prog must stay up to be running (def. 1)
75startretries=3 ; max # of serial start failures when starting (default 3)
76autorestart=unexpected ; autorestart if exited after running (def: unexpected)
77exitcodes=0 ; 'expected' exit codes used with autorestart (default 0)
78stopsignal=TERM ; signal used to kill process (default TERM)
79stopwaitsecs=10 ; max num secs to wait b4 SIGKILL (default 10)
80user=root ; setuid to this UNIX account to run the program
81redirect_stderr=false ; redirect_stderr=true is not allowed for eventlisteners
82stdout_logfile=event.log ; stdout log path, NONE for none; default AUTO
83stderr_logfile=event.log
84
需要注意的几个点:
第一个minfds
和minprocs
,前面通过systemd设置supervisor进程的fds和mem,此处需要配置supervisor启动的进程的最小文件描述符数和最小进程数。
第二个startsecs=120
,该配置表示supervisor启动es后,将es的状态保留在STARTING
120秒,之后es的状态就会进入RUNNING
。该配置默认值是1s,而一般来说es启动过程至少超过20秒,所以如果采用默认设置或者停留时间过少,并且autorestart=true
,当es启动报错时,supervisor会反复重启es,忽略了startretries=3
,进而无法触发报警。原因在于,startsecs过短,会导致进程在STARTING
和RUNNING
状态反复横跳。而startretries=3
触发条件是BACKOFF
状态。
第三个stopsignal=TERM
,es使用ctrl+c时发送的是SIGTERM信号,可以使es正常退出。supervisor通过echo_supervisord_conf
命令会默认生成stopsignal=QUIT
导致es无法正常退出。
第四个environment=JAVA_HOME=""
,有些环境下设置JAVA_HOME,无法使用es自带的JDK,而supervisor不能删除一个环境变量,可以使用将环境变量置空的方式。
0x04 编写事件通知
事件状态一共3种:
Name | Description |
---|---|
ACKNOWLEDGED | The event listener has acknowledged (accepted or rejected) an event send. |
READY | Event notifications may be sent to this event listener |
BUSY | Event notifications may not be sent to this event listener. |
When an event listener process first starts, supervisor automatically places it into the
ACKNOWLEDGED
state to allow for startup activities or guard against startup failures (hangs). Until the listener sends aREADY\n
string to its stdout, it will stay in this state.When supervisor sends an event notification to a listener in the
READY
state, the listener will be placed into theBUSY
state until it receives anOK
orFAIL
response from the listener, at which time, the listener will be transitioned back into theACKNOWLEDGED
state.1
简单理解,当事件监听器启动后,首先会处于ACKNOWLEDGED
状态,当接收到READY
消息,会使用readline阻塞读取supervisor发送的消息,然后处于BUSY
状态,直到本次事件处理完毕。
991#!/usr/bin/env python3
2import sys
3import json
4import socket
5import time
6import requests
7from loguru import logger
8
9_local_ip = None
10# 调试
11# _dd_url = 'x'
12
13# 正式
14_dd_url = 'x'
15
16
17def get_host_ip():
18 """
19 这个方法是目前见过最优雅获取本机服务器的IP方法了。没有任何的依赖,也没有去猜测机器上的网络设备信息。
20 而且是利用 UDP 协议来实现的,生成一个UDP包,把自己的 IP 放如到 UDP 协议头中,然后从UDP包中获取本机的IP。
21 这个方法并不会真实的向外部发包,所以用抓包工具是看不到的。但是会申请一个 UDP 的端口,所以如果经常调用也会比较耗时的,这里如果需要可以将查询到的IP给缓存起来,性能可以获得很大提升。
22 :return:
23 """
24 global _local_ip
25 s = None
26 try:
27 if not _local_ip:
28 s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
29 s.connect(('223.5.5.5', 80))
30 _local_ip = s.getsockname()[0]
31 return _local_ip
32 finally:
33 if s:
34 s.close()
35
36
37def _dd_send(data):
38 message = json.dumps({
39 "msgtype": "text",
40 "at": {
41 "atMobiles": [xxx],
42 "atUserIds": [],
43 "isAtAll": False
44 },
45 "text": {"content": "{}".format(data)}
46 })
47 headers = {
48 'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405',
49 'Content-Type': 'application/json'
50 }
51 try:
52 requests.post(_dd_url, headers=headers, data=message)
53 # logger.info('Sends Successfully with {}'.format(data))
54 return True
55 except Exception as e:
56 logger.error('Sends Failed with {}'.format(e))
57 return False
58
59
60def write_stdout(s):
61 # only eventlistener protocol messages may be sent to stdout
62 sys.stdout.write(s)
63 sys.stdout.flush()
64
65
66def write_stderr(s):
67 sys.stderr.write(s)
68 sys.stderr.flush()
69
70
71def main():
72 get_host_ip()
73 logger.remove()
74 log_format = '<green>{time:YYYY-MM-DD HH:mm:ss.SSS}</green> | <level>{level}</level> | <cyan>{file}</cyan>:<cyan>{line}</cyan> - <level>{message}</level>'
75 logger.add('/data1/es_event_monitor.log', format=log_format,
76 rotation='50 MB', colorize=True)
77 while 1:
78 # transition from ACKNOWLEDGED to READY
79 write_stdout('READY\n')
80
81 # read header line and print it to stderr
82 line = sys.stdin.readline()
83
84 # read event payload and print it to stderr
85 headers = dict([x.split(':') for x in line.split()])
86 notify_data = sys.stdin.read(int(headers['len']))
87 if 'es_node' in notify_data:
88 now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
89 message = f'\nES节点宕机监测\n时间: {now}\n节点: {_local_ip}\n事件: {notify_data}'
90 logger.info(message)
91 _dd_send(message)
92
93 # transition from READY to ACKNOWLEDGED
94 write_stdout('RESULT 2\nOK')
95
96
97if __name__ == '__main__':
98 main()
99
0x05 参考
Comments NOTHING