空指针+nginx配置导致的502

问题描述:
服务的不同接口不间断的报出502,分布在不同的接口和不同的nginx服务上,很是怪异。

竞赛生产日志平台:
nginx中的error.log
2020/12/23 16:59:59 [error] 22636#0: *380224130 no live upstreams while connecting to upstream, client: 100.117.86.88, server: aa.code.com, request: "GET /api/competit
ion/process/student/detail?competitionId=127 HTTP/1.1", upstream: "http://aa_xes/api/competition/process/student/detail?competitionId=127", host: "aa.code.com", r
eferrer: "https://aa.code.com/?id=c6a4761c4b3974d6fe56d77c8ebe3a0a&code=7597a7b2e4d0c50ac96db0cefca6f30448bda50049bf3307ab4a5e4e030afb88796788528826034928717be5efd3da6
e"
2020/12/23 16:59:59 [error] 22636#0: *380224133 no live upstreams while connecting to upstream, client: 100.117.86.51, server: aa.code.com, request: "GET /api/competit
ion/user/public/getCompPage?id=c6a4761c4b3974d6fe56d77c8ebe3a0a HTTP/1.1", upstream: "http://aa_xes/api/competition/user/public/getCompPage?id=c6a4761c4b3974d6fe56d77c8ebe3a0
a", host: "aa.code.com", referrer: "https://aa.code.com/?id=c6a4761c4b3974d6fe56d77c8ebe3a0a&code=443082759c2302f47aa47b7c0a92bed72bc1f473c4da309c7e5650e7d
3e662e1"

access.log 中的错误日志
{"@timestamp":"2020-12-23T17:56:29+08:00","cookie_id":"-","client_ip":"100.117.86.102","remote_user":"-","request_method":"GET","domain":"aa.code.com","user_agent":"Mo
zilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36","xff":"218.79.230.248","upstream_addr":"172.16.1.52:8380","upstr
eam_response_time":"0.017","request_time":"0.017","size":"113","idc_tag":"tjtx","status":"500","upstream_status":"500","host":"bcy-nginx01","via":"-","protocol":"http","request_ur
i":"/api/competition/process/student/detail?competitionId=127","http_referer":"https://aa.code.com/?id=c6a4761c4b3974d6fe56d77c8ebe3a0a&code=3e2418e23c5b47051dfd61ce13
b6f5916c7ff04509bfb837ebb7d08b2d85912f"}

服务中的错误日志
2020-12-23 18:12:50.355[ERROR][  XNIO-1 task-4]       c.t.c.w.e.GlobalExceptionHandler        : [Handle_Exception] - java.lang.NullPointerException
        at com.tal.competition.service.CompetitionProcessItemService.lambda$findStudentProcesses$2(CompetitionProcessItemService.java:385)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1507)
        at com.tal.competition.service.CompetitionProcessItemService.findStudentProcesses(CompetitionProcessItemService.java:332)

先理解upstream upstream jingsai_xes {
      server 172.16.1.53:8380;
      server 172.16.1.51:8380;
      server 172.16.1.48:8380;
      server 172.16.1.49:8380;
      server 172.16.1.47:8380;
      server 172.16.1.52:8380;
      server 172.16.1.44:8380;
      server 172.16.1.50:8380;
      server 172.16.1.45:8380;
      server 172.16.1.46:8380;
      check interval=2000 rise=2 fall=1 timeout=1000 type=tcp port=8380;
 }
 这些都是配置的upstream。

proxy_next_upstream含义:如果某个机器上的幂等服务报错,则会到下一个upstream(服务器)
max_fails参数的理解:根据上面的解释,max_fails默认为1,fail_timeout默认为10秒,也就是说,默认情况下后端服务器 在10秒钟之内可以容许有一次的失败,如果超过1次则视为该服务器有问题,将该服务器标记为不可用。等待10秒后再将请求发给该服务器
参考:http://blog.chinaunix.net/uid-29580597-id-4415903.html

check interval=2000 rise=2 fall=1 timeout=1000 type=tcp port=8098;
#对负载均衡池中的所有节点,每个2秒检测一次,请求2次正常则标记realserver状态为up,如果检测2次都失败,标记realserver的状态为down,后端健康请求的超时时间为1s,健康检查包的类型为http请求。

把这个配置加进去,变成: proxy_next_upstream error timeout http_500 non_idemponent; 问题终于解决了。
这段话的意思是说,像 post, lock, patch 这种会对服务器造成不幂等的方法,默认是不进行重试的,如果一定要进行重试,则要加上这个配置。
参考:https://zhuanlan.zhihu.com/p/35803906

单台nginx认定所有服务时效的场景:
如果该台机器10秒内调用 /api/competition/process/student/detail?competitionId=127

本质原因:
127请求进入nginx,转到某个服务器报了500,请求会在所有机器转发一遍,如果127这个请求10秒内两次达到机器上,nginx就认为所有机器不可用,报出no live upstreams while connecting to upstrea

为什么只有竞赛有该问题?
proxy_next_upstream中没有配置http_500

5时0分钟errlog统计
nginx 01  
只有7-14秒没有报 no live upstreams while connecting to upstream
且accessLog说明  07-08秒有请求进入

nginx 02 
16-20秒没有报 no live upstreams while connecting to upstream

nginx 03
02-06 29-32 41-45 54-60

秒没有报 no live upstreams while connecting to upstream


配置修改
proxy_next_upstream_tries 3
status code 返回的是 502

 server {
         listen        80;
        server_name   aa.code.com;
        proxy_next_upstream   http_500  http_502 http_503 http_504 error timeout invalid_header;
        proxy_set_header Accept-Encoding "";
        client_max_body_size 64M;
        access_log      /data/log/nginx/aa-access.log main_json;
        location / {
            proxy_read_timeout      3600;
            proxy_connect_timeout   300;
            proxy_set_header Host $http_host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto https;
            root /data/webroot/cp-website/;
        }

修改后nginx的配置(即在proxy_next_upstream 中去掉了http_502)
    server {
         listen        80;
        server_name   aa.code.com;
        proxy_next_upstream     http_502 http_503 http_504 error timeout invalid_header;
        proxy_set_header Accept-Encoding "";
        client_max_body_size 64M;
        access_log      /data/log/nginx/aa-access.log main_json;
        location / {
            proxy_read_timeout      3600;
            proxy_connect_timeout   300;
            proxy_set_header Host $http_host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto https;
            root /data/webroot/cp-website/;
        }

相似问题博客:
https://blog.****.net/piaohai/article/details/102753168 Nginx 502问题排查 - proxy_next_upstream
https://blog.51cto.com/zhangshujian/1092800 Nginx 502 503 错误触发条件与解决办法汇总


nginx500转502 原因汇总:
supvisor ctrl
nginx机器配置
内部后台服务500,被转成了502

上一篇:迷宫问题(bfs基础)


下一篇:ModbusTCP报文格式说明:功能码0X03