Swarm的Overlay网络无法获取原始IP的问题

Swarm的Overlay网络问题

典型问题

原始IP请求进入Ingress网络(swarm load balancer)后,原始IP被SNAT修改为内部Ingress网络的IP段,如10.255.0.x,之后才会被转发到service container。

这会导致在service container中获取的远端IP始终是Ingress内部的虚拟IP段。

此时会导致白名单、外网IP访问日志记录等依托正确外网IP解析的功能全部失效。

解决方案

官方没有给出解决方案,swarm已经停止更新维护,目前只有通过host网络模式绕过,再附加外部LB来实现Swarm内置的Service Routing Mesh能力。

但是这样会对应用架构和部署架构做出改动,多副本模式在此时无法使用,系统可用性受到影响。

大多数Swarm集群上部署的应用最终都采用host模式+打标签固定绕开了这个问题,算是一种妥协方案。

那么有没有不修改应用架构又能解决真实IP获取的轻量级解决方案呢?答案是有的。

社区近2年有人提出了docker ingress routing daemon的后台脚本增强方案,是一个轻量级的解决方案,需要在swarm各个节点实现运行一个后台脚本。

该脚本需要指定一个ingress-gateway-ips的参数,可通过裸运行该脚本查看每一个swarm节点的ingress内部ip,通常为10.255.0.x。

Usage: /home/swarm/docker-ingress-routing-daemon.sh [--install [OPTIONS] | --uninstall | --help]
           --services <services>  - service names to disable masquerading for
           --tcp-ports <ports>    - TCP ports to disable masquerading for
           --udp-ports <ports>    - UDP ports to disable masquerading for
   --ingress-gateway-ips <ips>    - specify load-balance ingress IPs
              --no-performance    - disable performance optimisations
    (services, ports and IPs may be comma or space-separated or may be specified
     multiple times)
!!! Ingress subnet: 10.255.0.0/16
!!! This node's ingress network IP: 10.255.0.2

注意:

1、该脚本需要在部署应用前提前运行,如果在已有应用集群上运行无法生效。

2、如果在应用集群运行后,该后台进程被杀掉,仍然不影响其作用(路由表依旧生效),但是后续新部署/更新的应用会失效。

3、运维应该通过机制确保该脚本对应的后台守护进程始终存在,如果被误杀需要在下次更新部署前拉起。

参考github - http://github.com/newsnowlabs/docker-ingress-routing-daemon

该后台脚本如下,参考命令:

nohup /home/swarm/docker-ingress-routing-daemon.sh --ingress-gateway-ips 10.255.0.2,10.255.0.3,10.255.0.4 --install &
#!/bin/bash
 
VERSION=3.3.0
 
# Ingress Routing Daemon v3.3.0
# Copyright © 2020-2021 Struan Bartlett
# ----------------------------------------------------------------------
# Permission is hereby granted, free of charge, to any person
# obtaining a copy of this software and associated documentation files
# (the "Software"), to deal in the Software without restriction,
# including without limitation the rights to use, copy, modify, merge,
# publish, distribute, sublicense, and/or sell copies of the Software,
# and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# ----------------------------------------------------------------------
# Workaround for https://github.com/moby/moby/issues/25526
 
log() {
  [ -z "$_BASHPID" ] && _BASHPID="$BASHPID"
  local D=$(date +%Y-%m-%d.%H:%M:%S.%N)
  local S=$(printf "%s|%s|%05d|" "${D:0:26}" "$HOSTNAME" "$_BASHPID")
  echo "$@" | sed "s/^/$S /g"
}    
 
quit() {
  trap '' EXIT TERM INT
  log "Docker Ingress Routing Daemon received signal, propagating it to pgroup $$ ..."
  kill -TERM -$$
  log "Docker Ingress Routing Daemon exiting."
  exit 0
}
 
detect_ingress() {
  read INGRESS_SUBNET INGRESS_DEFAULT_GATEWAY \
    < <(docker inspect ingress --format '{{(index .IPAM.Config 0).Subnet}} {{index (split (index .Containers "ingress-sbox").IPv4Address "/") 0}}' 2>/dev/null)
 
  [ -n "$INGRESS_SUBNET" ] && [ -n "$INGRESS_DEFAULT_GATEWAY" ] && nsenter --net=/var/run/docker/netns/ingress_sbox iptables -L >/dev/null && return 0
   
  return 1
}
 
usage() {
  echo "Usage: $0 [--install [OPTIONS] | --uninstall | --help]" >&2
  echo >&2
  echo "           --services <services>  - service names to disable masquerading for" >&2
  echo "           --tcp-ports <ports>    - TCP ports to disable masquerading for" >&2
  echo "           --udp-ports <ports>    - UDP ports to disable masquerading for" >&2
  echo "   --ingress-gateway-ips <ips>    - specify load-balance ingress IPs" >&2
  echo "              --no-performance    - disable performance optimisations" >&2
  echo >&2
  echo "    (services, ports and IPs may be comma or space-separated or may be specified" >&2
  echo "     multiple times)" >&2
  echo >&2
   
  if detect_ingress; then
    echo "!!! Ingress subnet: $INGRESS_SUBNET" >&2
    echo "!!! This node's ingress network IP: $INGRESS_DEFAULT_GATEWAY" >&2
  fi
   
  echo >&2
  exit 1
}
 
SCRIPT_PATH=$(dirname $(realpath $0))
DOCKER=$(which docker)
 
while true
do
  case "$1" in
        --services) shift; SERVICES+=($(echo "$1" | tr ',' ' ')); shift; continue; ;;
       --tcp-ports) shift; TCP_PORTS+=($(echo "$1" | tr ',' ' ')); shift; continue; ;;
       --udp-ports) shift; UDP_PORTS+=($(echo "$1" | tr ',' ' ')); shift; continue; ;;
         --install) shift; INSTALL=1; continue; ;;
       --uninstall) shift; INSTALL=0; continue; ;;
     --ingress-gateway-ips) shift; INGRESS_NODE_GATEWAY_IPS+=($(echo "$1" | tr ',' ' ')); shift; continue; ;;
  --no-performance) shift; PERFORMANCE=0; continue; ;;
    
      -h|--help) usage; ;;
             '') break; ;;
        
              *) usage; break; ;;
  esac
done
 
# Display usage, unless --install or --uninstall
[ -z "$INSTALL" ] && usage
 
# Convert arrays to comma-separated strings
TCPServicePortString=$(echo ${TCP_PORTS[@]} | tr ' ' ',')
UDPServicePortString=$(echo ${UDP_PORTS[@]} | tr ' ' ',')
 
if [ ${#INGRESS_NODE_GATEWAY_IPS[@]} -gt 0 ]; then
  INGRESS_NODE_GATEWAY_IPS=$(echo ${INGRESS_NODE_GATEWAY_IPS[@]})
fi
 
# Prevent "WARNING: Error loading config file: .dockercfg: $HOME is not defined" messages
export HOME=$SCRIPT_PATH
 
if ! [ -x "$DOCKER" ]; then
  echo "Docker binary not found; exiting." >&2
  exit -1
fi
 
log "Docker Ingress Routing Daemon $VERSION starting ..."
 
if detect_ingress; then
  log "Detected ingress subnet: $INGRESS_SUBNET" >&2
  log "This node's ingress network IP: $INGRESS_DEFAULT_GATEWAY" >&2
else
  log "Couldn't identify ingress network subnet or this node's ingress network IP; sleeping 1s, then exiting."
  sleep 1
  exit -1
fi
 
# Delete any relevant preexisting rules.
nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat -S | \
  grep -- '-m ipvs --ipvs -j ACCEPT' | \
  sed -r 's/^-A /-D /' | \
  while read RULE; \
  do
    log "Deleting old rule: iptables -t nat $RULE"
    nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat $RULE
  done
 
nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t mangle -S | \
  grep -- '-j TOS --set-tos' | \
  sed -r 's/^-A /-D /' | \
  while read RULE; \
  do
    log "Deleting old rule: iptables -t mangle $RULE"
    nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t mangle $RULE
  done
 
nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t raw -S | \
  grep -- '-j CT --notrack' | \
  sed -r 's/^-A /-D /' | \
  while read RULE; \
  do
    log "Deleting old rule: iptables -t raw $RULE"
    nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t raw $RULE
  done
 
if [ "$INSTALL" = "0" ]; then
  log "Docker Ingress Routing Daemon iptables rules uninstalled, exiting."
  exit 0
fi
 
###############
# $INSTALL is 1
#
 
if [ -z "$INGRESS_NODE_GATEWAY_IPS" ]; then
  INGRESS_NET=$(echo $INGRESS_DEFAULT_GATEWAY | cut -d'.' -f1,2,3)
  INGRESS_NODE_GATEWAY_IPS="$INGRESS_NET.2 $INGRESS_NET.3 $INGRESS_NET.4 $INGRESS_NET.5 $INGRESS_NET.6 $INGRESS_NET.7 $INGRESS_NET.8 $INGRESS_NET.9"
   
  log "!!! -------------------------- WARNING ------------------------------------"
  log "!!! Assuming --ingress-gateway-ips $INGRESS_NODE_GATEWAY_IPS"
  log "!!!"
  log "!!! Please compile a list of the ingress network IPs of each of your nodes"
  log "!!! that you will be using as a load-balancer."
  log "!!!"
  log "!!! You only have to do this once, or whenever you change your set of"
  log "!!! load-balancer nodes."
  log "!!!"
  log "!!! Then relaunch using:"
  log "!!! $0 --install --ingress-gateway-ips \"<Node Ingress IP List>\""
  log "!!! ----------------------------------------------------------------------"
 
fi
 
log "Running with --ingress-gateway-ips $INGRESS_NODE_GATEWAY_IPS"
 
# Create node ID from INGRESS_DEFAULT_GATEWAY final byte
NODE_ID=$(echo $INGRESS_DEFAULT_GATEWAY | cut -d'.' -f4)
log "This node's ID is: $NODE_ID"
 
# Add a rule ahead of the ingress network SNAT rule, that will cause the SNAT rule to be skipped.
if [ -z "$TCPServicePortString" ] && [ -z "$UDPServicePortString" ]; then
  log "Adding ingress_sbox iptables nat rule: iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -m ipvs --ipvs -j ACCEPT"
  nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -m ipvs --ipvs -j ACCEPT
 
  # 1. Set TOS to NODE_ID in all outgoing packets to INGRESS_SUBNET
  log "Adding ingress_sbox iptables mangle rule: iptables -t mangle -A POSTROUTING -d $INGRESS_SUBNET -j TOS --set-tos $NODE_ID/0xff"
  nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t mangle -A POSTROUTING -d $INGRESS_SUBNET -j TOS --set-tos $NODE_ID/0xff
 
  log "Adding ingress_sbox connection tracking disable rule: iptables -t raw -I PREROUTING -j CT --notrack"
  nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t raw -I PREROUTING -j CT --notrack
else
 
  if [ -n "$TCPServicePortString" ]; then
    log "Adding ingress_sbox iptables nat rule: iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -p tcp -m multiport --dports $TCPServicePortString -m ipvs --ipvs -j ACCEPT"
    nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -p tcp -m multiport --dports $TCPServicePortString -m ipvs --ipvs -j ACCEPT
 
    # 1. Set TOS to NODE_ID in all outgoing packets to INGRESS_SUBNET
    log "Adding ingress_sbox iptables mangle rule: iptables -t mangle -A POSTROUTING -d $INGRESS_SUBNET -p tcp -m multiport --dports $TCPServicePortString -j TOS --set-tos $NODE_ID/0xff"
    nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t mangle -A POSTROUTING -d $INGRESS_SUBNET -p tcp -m multiport --dports $TCPServicePortString -j TOS --set-tos $NODE_ID/0xff
 
    log "Adding ingress_sbox connection tracking disable rule: iptables -t raw -I PREROUTING -p tcp -m multiport --dports $TCPServicePortString -j CT --notrack"
    nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t raw -I PREROUTING -p tcp -m multiport --dports $TCPServicePortString -j CT --notrack
  fi
 
  if [ -n "$UDPServicePortString" ]; then
    log "Adding ingress_sbox iptables nat rule: iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -p udp -m multiport --dports $UDPServicePortString -m ipvs --ipvs -j ACCEPT"
    nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t nat -I POSTROUTING -d $INGRESS_SUBNET -p udp -m multiport --dports $UDPServicePortString -m ipvs --ipvs -j ACCEPT
 
    # 1. Set TOS to NODE_ID in all outgoing packets to INGRESS_SUBNET
    log "Adding ingress_sbox iptables mangle rule: iptables -t mangle -A POSTROUTING -d $INGRESS_SUBNET -p udp -m multiport --dports $UDPServicePortString -j TOS --set-tos $NODE_ID/0xff"
    nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t mangle -A POSTROUTING -d $INGRESS_SUBNET -p udp -m multiport --dports $UDPServicePortString -j TOS --set-tos $NODE_ID/0xff
 
    log "Adding ingress_sbox connection tracking disable rule: iptables -p udp -m multiport --dports $UDPServicePortString -j CT --notrack"
    nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t raw -I PREROUTING -p udp -m multiport --dports $UDPServicePortString -j CT --notrack
  fi
 
fi
 
if [ "$PERFORMANCE" != "0" ]; then
  # Set sysctl variables
  log "Setting ingress_sbox namespace sysctl variables net.ipv4.vs.conn_reuse_mode=0 net.ipv4.vs.expire_nodest_conn=1 net.ipv4.vs.expire_quiescent_template=1"
  nsenter --net=/var/run/docker/netns/ingress_sbox sysctl net.ipv4.vs.conn_reuse_mode=0 net.ipv4.vs.expire_nodest_conn=1 net.ipv4.vs.expire_quiescent_template=1
   
  log "Setting ingress_sbox namespace sysctl conntrack variables from /etc/sysctl.d/conntrack.conf"
  [ -f "/etc/sysctl.d/conntrack.conf" ] && nsenter --net=/var/run/docker/netns/ingress_sbox sysctl --load=/etc/sysctl.d/conntrack.conf
   
  log "Setting ingress_sbox namespace sysctl ipvs variables from /etc/sysctl.d/ipvs.conf"
  [ -f "/etc/sysctl.d/ipvs.conf" ] && nsenter --net=/var/run/docker/netns/ingress_sbox sysctl --load=/etc/sysctl.d/ipvs.conf
fi
 
log "Docker Ingress Routing Daemon launching docker event watcher in pgroup $$ ..."
 
# Turn off job control
set +m
 
# Set lastpipe, so that the 'while read' runs in the main shell process,
# making the script more resilient to subprocess exit.
shopt -s lastpipe
 
trap quit EXIT TERM INT
 
# Watch for container start events, and configure policy routing rules on each container
# to ensure return path traffic for incoming connections is routed back via the correct interface
# and to the correct node from which the incoming connection was received.
docker events \
  --format '{{.ID}} {{index .Actor.Attributes "com.docker.swarm.service.name"}}' \
  --filter 'event=start' \
  --filter 'type=container' | \
  while read ID SERVICE
  do
    if [ -z "$SERVICE" ]; then
      continue
    fi
 
    if [ ${#SERVICES[@]} -gt 0 ] && ! [[ " ${SERVICES[@]} " =~ " $SERVICE " ]]; then
      log "Container SERVICE=$SERVICE, ID=$ID launched: unmatched service, so skipping."
      continue
    fi
 
    NID=$(docker inspect -f '{{.State.Pid}}' $ID)
    CIF=$(nsenter -n -t $NID ip -brief addr show to $INGRESS_SUBNET | cut -d'@' -f1)
 
    if [ -z "$CIF" ]; then
      log "Container SERVICE=$SERVICE, ID=$ID, NID=$NID launched: no ingress network interface found, so skipping applying policy routes."
      continue
    fi
 
    log "Container SERVICE=$SERVICE, ID=$ID, NID=$NID launched: ingress network interface $CIF found, so applying policy routes."
      
    # 3. Map any connection mark on outgoing tcp or udp traffic to a firewall mark on the individual packets.
    #    These rules /could potentially/ be applied more selectively, according to --tcp-ports and --udp-ports, to make
    #    a marginal efficiency gain, but this is not necessary: as, if no connection mark has been set, because no
    #    TOS byte has been set by the load balancer, then none will be restored and legacy routing rules will apply.
    #    - See https://github.com/newsnowlabs/docker-ingress-routing-daemon/issues/11
    nsenter -n -t $NID iptables -t mangle -A OUTPUT -p udp -j CONNMARK --restore-mark
    nsenter -n -t $NID iptables -t mangle -A OUTPUT -p tcp -j CONNMARK --restore-mark
 
    # 3.1 Enable 'loose' rp_filter mode on interface $CIF (and 'all' as required by kernel
    #     see https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt)
    nsenter -n -t $NID sysctl net.ipv4.conf.all.rp_filter=2 net.ipv4.conf.$CIF.rp_filter=2
 
    for NODE_IP in $INGRESS_NODE_GATEWAY_IPS
    do
      NODE_ID=$(echo $NODE_IP | cut -d'.' -f4)
   
      # 2. Map the TOS value on any incoming packets to a connection mark, using the same value.
      nsenter -n -t $NID iptables -t mangle -A PREROUTING -m tos --tos $NODE_ID/0xff -j CONNMARK --set-xmark $NODE_ID/0xffffffff
   
      # 4. Select the correct routing table to use, according to the firewall mark on the outgoing packet.
      nsenter -n -t $NID ip rule add from $INGRESS_SUBNET fwmark $NODE_ID lookup $NODE_ID prio 32700
   
      # 5. Route outgoing traffic to the correct node's ingress network IP, according to its firewall mark
      #    (which in turn came from its connection mark, its TOS value, and ultimately its IP).
      nsenter -n -t $NID ip route add table $NODE_ID default via $NODE_IP
   
    done
  done
上一篇:阿里云ECS服务器购买流程(小白图文教程)


下一篇:从零开始MDT2010学习手记(四) 导入操作系统、驱动和补丁