Birkhoff-von Neumann Crossbar 光交换网络的调度方案

2024-02-29 15:15:28

Birkhoff-von Neumann Crossbar 光交换网络的调度方案

This is a summary aimed at looking for "high performance novel scheduling algorithm for fast optical switch in data center network"

主要基于 [1~6]，也包含一些对 optical data center network 相关技术的理解，并且拿了一个自以为蛮漂亮的调度算法做为了例子( •̀ ω •́ )✧，ゲームをする！

Network Architecture and Optical Switching Technology

这里首先对 optical data center network 的目标优势与挑战进行一些概述 [1]，而后介绍这一技术中占据较重要地位的光交换机 (我有时也很随意地直接叫它光开关) 技术

Demands of Optical Data Center Networks

光数据中心网络 (optical data center networks, optical DCNs) 是被用以符合高容量，低延迟，细粒度的 DCN 需求的一种架构，它有望弥补下面列举的几条传统电子方案的缺陷 [1.1]

In hierarchical tree-like topology based on electronic switch, the scaling issues of ball grid array (BGA) package causing bandwidth limited by application-specific integrated circuits (ASICs) input/output (I/O) bandwidth

Electronic switch achieves higher bandwidth by stacking ASICs in multitier structure, which leads to larger latency and higher cost;

Electronic circuit-based solution is facing a difficulties when large interconnectivity and fast reconfiguration are required;

Storing and transmitting properties of electronic switch have significant power dissipation, and the optical/electronic/optical (O/E/O) conversions and format-dependent interfaces will perform lower power/cost efficiency.

这就成为了故事的开端，可以发现这些困境大体是电子器件的一些内禀性质 (fermion... ... 好麻烦... ...) 所导致的，有一些甚至是不可避免的

另一方面，从下面的表格就可以看出光子学解决方案的“显著”优势，它几乎覆盖了人们对于未来 DCN 的所有幻想，这样，各种光交换机与一些调度算法或许能在这里拥有自己的一片天空... ...

Requirements	Optical Solution
Capacity <bandwidth for more inter-rack & intra-cluster communication>	<wavelength division multiplex (WDM) to provide power per unit bandwidth>
Latency <switch latency from buffering & routing algorithm & arbitration>	<optical packet/burst switching (OBS, OPS) based on fast optical circuit switching (OCS)>
Interconnectivity <support concurrent flows & bandwidth ultilization & service delivery>	<(high-capacity) optical interconnection technology>
Scalability <scale nodes number for future capacity & cost>	<photonic integration may provide scaling without loss of capacity and power/cost>
Flexibility <manage servise delivery & adapt to changing needs>	<software-defined networking (SDN) is used to facilitate highly flexible provisioning>
Power/Cost Efficiency <reduce cost & scale bandwidth>	<data traffic in optical domain, no O/E/O power consumption>

顺带一提，里面涉及到的 OPS/OBS，光分组交换/光突发交换技术能借由突发流的统计复用实现亚波长的带宽粒度，其中，OPS 网络在相同信道中同时/先后传输包头 (package header) 与数据包 (package, $\it10^2\rm\ ns\sim\it10^2\rm\ ms$)，不需要提前预留连接，也有很好的 bandwidth efficiency，适合于 DC 较小数据量的传输，OBS 则借助于大量的突发 (burst, $\it10\sim10^3\rm\ ms$) 与先于其在 (物理上的) 分离信道中发送的块控制数据头 (block control header, BCH) 实现资源预留，保留 OPS 的 flexibility 以及 efficiency 的同时降低了对光交换机和光缓存的要求，因为一般光交换机的 reconfiguration time 需要远小于包/突发的时长

作为 optical DCNs 架构的实例，可以参考 Opsquare: A flat DCN architecture based on flow-controlled optical packet switches [4] 以及 Torus-topology Data Center Network based on Optical Packet/Agile Circuit Switching with Intelligent Flow Management [5] 这两种基于 OPS 的架构，它们在当前 ($\it2020\pm$) 各种架构中拥有相对较优的性能，都有将 electronic buffer (EB) 设置在网络连接部分，从而起到在冲突中重传数据包或等待调度指令的作用。两个网络分别是基于 OPS 与 OCS 的 Torus (同时使用了 Deflection, ODL 和 EB) 以及基于 OPS 的 OPSquare，顺带一提，它们在 scalability, flexibility 以及 power/cost 方面也都有不错的表现

Optical Switches

由上所述，optical DCNs 的性能很大程度上受制于光交换机的性能。下面是一些比较常用，且有可能 (有些已经具备了商业化的能力甚至已经进入市场) 被用于集成的光交换机，它们的 switching time 基本都控制在 $\it1\rm\ ns\sim\it10\rm\ \mu s$ 的范围内 [1.1]

Switch Type	Switching Time	Scale	Loss
Mach-Zehnder interferometer (MZI, thermo-optic)	$\sim\it10\rm\ \mu s$	$\it32\times32$	High
micro-ring resonator (MRR, thermo-optic)	$\sim\it10\rm\ \mu s$	$\it8\times8$	Fair
PLZT MZI	$\sim\it10\rm\ ns$	$\it8\times8$	High
InP MZI	$\sim\it2.5\rm\ ns$	$\it8\times8$	High
LiNbO₃ MZI	$\sim\it1\rm\ ns$	$\it32\times32$	High
MZI + electro-absorption modulator (EAM)	$<\it10\rm\ ns$	$\it8\times8$	Fair
tunable wavelength converters (TWCs) + arrayed waveguide grating router (AWGR)	$\sim\it10s\rm\ ns$	$\it10s\times10s$	Fair
tunable lasers (TLs) + AWGR	$\sim\it10s\rm\ ns$	$\it100s\times100s$	Low
semiconductor optical amplifier (SOA) multistage	$<\it10\rm\ ns$	$\it16\times16$	Fair

稍慢一些的光交换机如基于微机电系统 (microelectromechanical system, MEMS) 的光交换机没有列在上面，这些光交换机大体可以 (依据其构造原理) 分为如下几类 [1.2]

Microelectromechanical system optical switch (基于控制微镜的运动控制，大体分为利用反射机制改变光的传播方向以及利用干涉衍射机制改变光的相位两类，前者易于进行批量制造)

特征数值 $T=50\rm\ ms,IL=3.5\ dB$；

Beam-steering-based optical switch (通过 2D piezoelectric actuators 将束流引导至对应的 $\varphi,\theta$ 位置，并由 integrated position sensor 探测光束指导反馈对束流的优化，具备软件定义网络 (software-defined networking, SDN) 的特性)

Liquid crystal optical switch (利用电压控制液晶 (liquid crystal, LC) 中分子取向实现极化或折射率的变更，有些设计涉及到了双折射及 Liquid crystal on silicon (LCoS) 的技术，受限于 polarization-dependent loss (PDL) 以及较高的串扰)

Electro-optic optical switch (基于电光效应 (electro-optic effect, EO effect，外加电场改变材料折射率，具备 $\rm ns$ 级的超快响应)，结合相位调制和干涉机制设计，采用 LiNbO₃，PLZT 等材料，较高的插损限制了 scalability)

Semiconductor optical amplifier-based switch (这个原理类似于一种开关器件，有外加电流的情况下，入射光就成功引发受激辐射光放大，没有的话也就没有了出射光，之后通过 broadcast-and-select 架构将其排列为光交换机的阵列，但是批量生产时这种设计需要 $N^2$ 个 SOA)

Thermo-optic optical switch (电流控制电极温度，进而影响材料的折射率，这样就能制造一些耦合器件作为开关使用，但是需要一些热光系数高且热导率低的材料，其优势在于不依赖于波长与偏振)

下面表格对这几类光交换机的性能给出了更加详细的评价 [1.2]

Characteristics	MEMS based	Beam-Steering	Liquid crystal	Electro-optic	SOA based	Thermo-optic
Switching speed	$\it10\sim20\rm\ ms$	$\it25\rm\ ms$	$\it100\rm\ ms$	$\mathbf{\sim\ ns}$	$\mathbf{\sim\ ns}$	$\it5\sim10\rm\ ms$
Insertion loss	Medium	Low	High	High	Low	Low
Power efficiency	Medium	Low	High	High	Low	Low
Scalability	Large ($\it320\times320$)	Large ($\it384\times384$)	Medium ($\it1\times20$)	Medium ($\it32\times32$)	Medium ($\it16\times16$)	Small ($\it8\times8$)
Reliability	Low	Low	High	High	Medium	High
Implementation cost	Medium	Medium	Low	High	High	Low

Challenges of Optical DCN Solution

由于当前缺少有效的光记忆器件，DC 中重要的冲突解决 (contention resolution) 功能不易实现，作为解决方案，fixed fiber delay line (FDL) 可以在时域暂存冲突包/突发，wavelength conversion 方法能够在频域将冲突转发到其它信道，deflection routing 提供在空域将冲突转发到其它端口的方案，但是它们都不可避免地引入了路由控制 (routing control) 以及包同步的问题。另一方面，未来 DCN 对于光交换机的高速切换以及一些相应的系统控制机制提出了更高的要求，当前的 optical DCNs 在 scaling 方面也受制于光交换机的端口数目，要想在这方面既避免使用层次结构，又保留不错的缓冲控制机制，或许得期待 photonic integration 的后续发展了

这些问题的解决方案更多来自于硬件方面如光器件的性能，但也并非不能考虑从一些调度算法 (scheduling algorithm) 的软件方面入手寻找思路，关于这一方面，可以直接转到后面的 scheduling algorithm，那里会有一些介绍性的内容，如果想看一个比较完整的算法，请移步 crossbar switches as example (●'◡'●)

Switching Architectures and Scheduling Algorithms

下面将重点放在了关于高速交换技术的一些基础知识，当然也涉及到了一些相关的调度算法

Basic Knowledge on Packet Switching

设计一个交换网络 (基于转发表 (forwarding table)，注意交换和路由是不同的，前者是对于特定网络节点而言，而后者关注大规模网络中节点间的信息传递)，需要考虑“输出冲突 (output contention)”、“出入端口连接的建立速度 (最好能够具备 self-routing 的能力)”的问题。在这里，首先我们需要一些定义来描述冲突，关于这点，有内部阻塞以及输出冲突两种类型，分别称作 blocking 和 output contention，与此同时，一些无阻塞的交换网络也是存在的，如 crossbar switch

在以上基础上，定义描述交换网络性能的两个物理量 throughput $X\propto\mathbb{E}[C_{\rm input}]/\mathbb{E}[\tau_{\rm output}]$ 以及 speedup $S=T_{\rm input}/T_{\rm forwarding}$，更高的 speedup 说明网络具备更强的冲突化解能力，对应着更高的 throughput

交换架构可以分为如下几类：

Packet switches
├── Time division switching (TDS)
│	├── Shared medium
│	└── Shared memory
└── Space division switching (SDS)
	├── Single path
	│	├── Crossbar
	│	├── Fully interconnected
	│	└── Banyan
	└── Multipath
		├── Augmented banyan
		├── Clos
		├── Multiplane
		└── Recirculation

其中，TDS 需要内部的通信带宽大于所有输入端口聚集的带宽，但能够轻松支持多播/广播，shared memory type 相比于 shared medium type 有更高的内存利用率但需要双倍内存速度；SDS 中，crossbar (更多信息见后面的 Crossbar Switches as Example) 是 fully interconnected type 的拓扑学等价，二者具备差不多的特性 (如 $N^2$ 复杂度和非阻塞结构)；而 banyan-based 交换机 (包含 Delta-Omega-Banyan 这三类同构的拓扑) 中，每条信息需经 $\log N$ 的节点以到达终点；Multipath 相比于 single path 具备了更强的容错能力，其中的 multiplane type，recirculation type 还能增大 throughput，但 recirculation 有时会导致信息的顺序错乱

数据缓冲策略有如下几类：

Buffering strategy
├── Shared-memory queuing
├── Output queuing (OQ)
├── Input queuing
├── Virtual output queuing (VOQ)
└── Combined input and output queuing (CIOQ)
.(Crosspoint queuing? place incoming cells in crosspoing buffer (XB))

其中，shared-memory queuing structure 通过输出端对于内存的共享实现了内存的最大利用率，但面临尺寸的限制；而 OQ 在以上缺点的基础上，又失去了内存利用率的优势 ε=ε=ε=┏(゜ロ゜;)┛，在服务质量 (Quality of Service, QoS) 的控制上有了一些提高；相比于这两个， input queuing 不受限于尺寸，但带来了线段阻塞 (head-of-line blocking, HOL blocking，尽管输出端口空闲，数据却被堵在了输入队列) 的问题，这使得它的 throughput 稍低一些 ($\it58.6\%$)；为解决 HOL blocking，VOQ 中的每个输入缓冲被分成 $N$ 份，这个可能是当下的研究热点 ($\it 2020\pm$)，其调度算法比前几个要复杂；CIOQ 则是另一种用于解决 HOL 的策略，它通过交换结构的 speedup 换取更高的 throughput

大规模 (多模块) 交换结构分为如下两种类型

Large-scale switch
├── Single-stage switch
│	└── Parallel packet switch (PPS)
└── Multistage switch
	├── Partially connected multistage network
	│	└── Banyan-based switch
	└── Fully connected multistage network
		└── Clos-network switch

其中，PPS 有最大的 throughput，其性能随 speedup 的大小而变化，但是 single-stage 难免无法支持庞大的端口数目；banyan-based network 和 Clos network 作为 multistage 则避免了上述缺陷，banyan network ($N\times N$ network, $\log_bN$ stages, $b\times b$ switching elements) 也支持 self-routing，但会带来内部阻塞，交换机数目的增大将导致其性能骤减；Clos network ($N\times N$ network, $\it3$ stages, $n$ input lines, $\it m\geq2n-1$ middle-stage switch modules, maximum number of crosspoints $N_x=O(N^{\it3/2})$) 尽管内部全连通，但需要相应 (复杂) 的调整才可使网络具备非阻塞性质，但即便如此也无法避免内部阻塞的发生，解决方案一般只能是粗暴地增加内部连接数目和带宽了... ...

Performance Analysis

了解了一些基本交换结构后，下面利用一点基础的排队论知识给出 input-buffered switches, output-buffered switches 以及 completely shared-buffer switches 这三种交换策略的性能分析 [2]

注意，这里采用的交通源模型是上图所示的几何分布流，于是负载 $\rho$ 就是每个时间间隔内信源处于激活状态的占比

\[\begin{equation} \begin{split} \rho=\frac{\beta=E[X]}{\beta=E[X]+\alpha=E[Y]}=\frac{\sum_{i=1}^\infty i{\rm Pr}\{X=i\}}{\sum_{i=1}^\infty i{\rm Pr}\{X=i\}+\sum_{j=1}^\infty j{\rm Pr}\{Y=j\}}=\frac{q}{q+p-pq}, \end{split} \end{equation} \]

Input-Buffered Switches

在有 HOL blocking 存在的情形下，总是会有信息传输延迟，因而这种方案无法获得 $\it100\%$ throughput，假若将信源定为无记忆型 (Markov)，每个端口就相当于一个 $\it M/D/1$ 排队模型，稳态下应有每段时间内传输的信息单元数 $\bar{F}$ 与目标为 $i$ 端口的，每段时间后仍滞留的信息单元数 $\bar{B}^i$ 满足如下关系

\[\begin{equation} \begin{split} \bar{F}=\sum_{i=1}^N\bar{B}^i, \end{split} \end{equation} \]

又注意到 $\bar{B}^i=\rho_0^2/[2(1-\rho_0)]$，$\rho_0$ 是归一化 throughput，可以由上式得到 $\bar{B}^i$ 的另一种表示方式

\[\begin{equation} \begin{split} \bar{B}^i=\frac1N\sum_{i=1}^N\bar{B}^i=1-\frac{\bar{F}}{N}=1-\rho_0=\frac{\rho_0^2}{2(1-\rho_0)}, \end{split} \end{equation} \]

由此可得 $\rho_0=(2-\sqrt2)\approx0.586$，这也就是 input-buffered switches 在这种信源下的渐进吞吐量了，换句话说，一旦输入速率大于这个值，系统就会饱和

下面借由基于上述几何分布的 $\it Geom/G/1$ 模型考察这种调度方式的等待时间，但是需要首先接受“到达每个端口的信息单元遵循独同分布以及拥有相同概率 $p$”以及“每个单位有相同几率选取任意输出端口为目标”这两个假设，再借助 $\it Geom/G/1$ 模型的结论，就可以知道平均等待时间为 (用到了服务时间 $S$，对于每段时间而言是一个随机变量)

\[\begin{equation} \begin{split} \overline{W}=\frac{p\overline{S(S-1)}}{2(1-p\bar{S})}+\bar{S}-1, \end{split} \end{equation} \]

显然，它的渐进形式还是蛮可怜的，随着 $N\to\infty$，它也以一个很快的速率趋近于 $\infty$ [2]

Output-Buffered Switches

这里 $N\to\infty$ 将导致信息单元到达输出端口的数目遵循指数分布 $p^ke^{-p}/k!$，其中的 $p$ 即为输入负载，由此可以立刻得到所谓信息单元丢失率 (若服务率/吞吐量无法满足输入负载的需求，则在输出端口的信息累积会导致溢出 output buffer 而丢失) 为 $1-\rho_0/p$，举个例子，对于 $\it80\%$ 的负载，需要 buffer 大小大于 $\it28$ 才能保证几乎无丢失

我们可以运用 little's law 来求得平均等待时间，也即，平均到达数 $\bar{Q}$ 等于平均等待时间 $\overline{W}$ 乘以服务率 $\rho_0$，对于 $N\to\infty,b\to\infty$，有

\[\begin{equation} \begin{split} \overline{W}=\frac{p}{2(1-p)}, \end{split} \end{equation} \]

Completely Shared-Buffered Switches

这里的所有信息单元都储存在同一个 buffer 中，也就是说输入与输出共享 buffer，它的性能显然会稍高于前两种方案，但是似乎只能靠一些数值模拟的方式获知它的具体的表现，与 output-buffered switches 不同的是，其丢失率随 buffer 大小变化呈现非线性 (buffer 越大，丢失率以越大的斜率迅速减小)，而前者是呈 (近似) 线性的恒定速率减小的，也就是说这里可以使用较小的 buffer 达到 $\it100\%$ throughput

Scheduling Algorithms

刚刚已经多次提到调度算法的由来了，简单概括一下，也就是说

首先由于日益增加的，对于高线速，高端口数的大规模交换机的需求，人们不再满足于 output buffer 的内存高利用率，因为其不具备这种 scalability ；

但是一旦开始使用 input buffer 或 combined-input-output-buffered 策略，就不可避免地需要一些增大 throughput 的方法 (提高交换机速度或增大路径数以解决 HOL blocking + 仲裁机制以解决冲突问题)；

应运而生的就是这些调度算法和路由技术了... ...

一个有不错性能的调度算法应具备如下特性

Efficiency，有高 throughput 同时也能有较低的 delay，对每个时段选取最多边的二部图匹配；

Fairness，能够避免单个 VOQ 的过度调用；

Stability，每个 VOQ 的占有数目都为有限值；

Implementation complexity，易于在硬件以较高速度实现

当然这里有很多算法都能符合要求，于是这里就直截了当地介绍一些适用于 input-buffered switches 的调度算法，也就是下面这些，它们不是全部，而且我也不会把每个都解释清楚，总之先对这几种算法有个大致印象叭...

二部图最大权匹配 (maximum weight matching，MWM, $O(N^3)$)，是我觉得最直接的一种方法，也即

\[\begin{equation} \begin{split} \max_{w_{ij}}\sum_{e_{(i,j)\in M}}w_{ij}, \end{split} \end{equation} \]

包含如 LQF (longest queue first, $w_{ij}(t)\leftarrow L_{ij}(t)$)，OCF (oldest cell first, $w_{ij}(t)\leftarrow \tau_{ij_{\rm HOL}}(t)$) 和 LPF (longest port first, $w_{ij}(t)\leftarrow\delta[L_{ij}(t)]\times[R_{i}(t)+C_{j}(t)]$) 的一系列算法

由它产生的有近似最大权 (1-APRX，牺牲了一些稳定性以换取低一些的复杂度) 和最大尺寸匹配 (maximum size matching, MSM, $O(N^{2.5})$)，是当边权为 $\it1$ 时的 MWM

极大匹配 (maximal matching)，对于 speedup $N$，总是可以由最多 $N$ 次迭代求出匹配结果，复杂度一般比较大

在其中已经有了如 parallel iterative matching (PIM), iterative round-robin matching (iRRM), iterative round-robin with SLIP (iSLIP), FIRM, dual round-robin matching (DRRM), pipelined maximal matching 和 exhaustive service dual round-robin matching (EDRRM) 等算法

随机匹配 (randomized matching)，尽管不一定一直选到最好的，但却有很高的稳定性和线性/对数复杂度

包含如 De-randomized, SERENA, HE-$i$SLIP 等算法

帧匹配 (frame-based matching)，为适应光交换的环境，采用多个 cells 合成单帧进行调度以提高效率，性能高开销低

包含如 reducing the reconfiguration frequency (exact covering, minimum switching, double), fixed-size frame-based matching (frame-based MWM, frame-based maximal weight matching, frame-based multiple iteration weight matching) 和 asynchronous variable-size frame-based matching (EDRRM) 等匹配算法

Speedup 下的稳定匹配，能够保证已匹配的端口具有足够高的优先级

包含如 most-urgent-cell-first algorithm (MUCFA), critical cell first (CCF), last-in-highest-priority (LIHP) 和 lowest-output-occupancy-cell-first (LOOFA) 等匹配算法

上面涉及到的符号都是网络科学中的惯常了，应该没啥理解偏差了，在满足一定条件的情形下，上述的各种调度算法都可以达到 $\it100\%$ 的 throughput

Crossbar Switches as Example

下面首先介绍一下 input buffered crossbar switches [3] 的基本结构，然后以一种适用于其的调度算法 (这种算法基于双随机矩阵的 Birkhoff-von Neumann 定理，且不需要 framing 以及 internal speedup) 作为例子来熟悉这类交换结构的功能

如上图所示 (是一个 $\it4\times4$ 的 crossbar 的结构示意图，以及它的基本冲突形式，这种结构显然很简单，且易于对其进行模块化与控制)，所谓 crossbar 就是由 $N\times N$ 个全交叉的连接组成的一个阵列 (i.e.，共 $\it2N$ 个 bars，一种 single path 的 SDS)，它们之间的连接在每一个时间都可以被一个置换矩阵 (permutation matrix) $P_k$ 描述，从 $i$ 端口到 $j$ 端口的 requested rate 为 $R=(r_{ij})$ 的矩阵元，若保证网络中没有超额订阅 (overbooking)，则须使非负矩阵 $R$ 为双随机的 (doubly stochastic) $$ \begin{equation} \begin{split} \left\{ \begin{split} \sum_{i=1}^Nr_{ij}\leq1,\ \ \forall j,\\ \sum_{j=1}^Nr_{ij}\leq1,\ \ \forall i, \end{split} \right. \end{split} \end{equation} $$ 可以得知，只要能够得到一组正数 (作为权值) $\phi_k$ (当然，满足 $\sum_{k=1}^K\phi_k=1$ 且 $\it K\leq N^2-2N+2$) 与 $P_k$ 的组合，使得以下分解不等式成立，即可直接将每一个 $P_k$ 设置为与其对应的 $\phi_k$ 成正比的形式 $$ \begin{equation} \begin{split} R\leq\sum_{k=1}^K\phi_kP_k, \end{split} \end{equation} $$

这种双随机矩阵的分解刚好对应着所谓 Birkhoff-von Neumann 算法，其正规表述见 付録┍1┙，分解步骤可以伪代码的形式表示如下，注意，这里涉及到共两个步骤，首先是寻找更大元素的双随机矩阵 (von Neumann)，而后对这个双随机矩阵进行分解，获得对应的正权与置换矩阵乘积之和的形式 (Birkhoff)，它们的复杂度都在下面标示出来了

$\mathbf{Algorithm}$:$\mathbf{\ Birkhoff}$-$\mathbf{von\ Neumann}$
$\mathbf{Input}$: the double stochastic matrix ${\color{#ff6347}{\boldsymbol{R}}}=[r_{ij}]$
$\mathbf{Output}$: the set/vector of positive numbers and permutation matrices $\{{\color{#c71585}{\boldsymbol{\Phi}}}=[{\color{#c71585}{\boldsymbol{\phi}}}_k],{\color{#c71585}{\boldsymbol{\mathscr{P}}}}=[{\color{#c71585}{\boldsymbol{P}}}_k]\}$

// find doubly stochastic matrix $\tilde{{\color{#ff8c00}{\boldsymbol{R}}}}=[\tilde{r}_{ij}],{\text{s.t.}}\forall i,j,\tilde{r}_{ij}\geq{r}_{ij}$, with total complexity of $O(N^3)$
$\tilde{{\color{#ff8c00}{\boldsymbol{R}}}}\leftarrow{\color{#ff6347}{\boldsymbol{R}}}$;
// with complexity of $O(2N-1)\sim O(N)$
$\color{#3cb371}{\mathbf{REPEAT}}$
| // with complexity of $O(N^2)$
| $[i,j]\leftarrow{\color{#696969}{\mathbf{find}}}(\tilde{{\color{#ff8c00}{\boldsymbol{R}}}},{\color{#696969}{\mathbf{bind}}}(\sum_nr_{in}<1,\sum_mr_{mj}<1))$;
| $\epsilon\leftarrow1-\max[\sum_nr_{in},\sum_mr_{mj}]$;
| $r_{ij}\leftarrow\epsilon+r_{ij}$;
| ${\color{#696969}{\mathbf{update}}}(\tilde{{\color{#ff8c00}{\boldsymbol{R}}}})$;
$\color{#3cb371}{\mathbf{UNTIL}}$ $\sum_{n,m}r_{nm}=N$;

// find set of positive numbers ${\color{#c71585}{\boldsymbol{\phi}}}_k>0$ with permutation matrices ${\color{#c71585}{\boldsymbol{P}}}_k,{\rm\ st.\ }\tilde{{\color{#ff8c00}{\boldsymbol{R}}}}=\sum_k{\color{#c71585}{\boldsymbol{\phi}}}_k{\color{#c71585}{\boldsymbol{P}}}_k$, with total complexity of $O(N^{4.5})$, which further means $\sum_{k=1}^K{\color{#c71585}{\boldsymbol{\phi}}}_k=1$, doubly stochastic matrices are within the convex hull of permutation matrices
$[i_1,i_2,\cdots,i_N]\leftarrow{\color{#696969}{\mathbf{find}}}(\tilde{{\color{#ff8c00}{\boldsymbol{R}}}},{\color{#696969}{\mathbf{bind}}}(\prod_{k=1}^N\tilde{r}_{ki_k}>0))$;
$\phi'\leftarrow\min_{1\leq k\leq N}[\tilde{r}_{ki_k}]$;
// the permutation matrix of $[i_1,i_2,\cdots,i_N]$
$s_{perm}\leftarrow{\color{#696969}{\mathbf{to\_perm}}}([i_1,i_2,\cdots,i_N])$;
$R'\leftarrow\tilde{{\color{#ff8c00}{\boldsymbol{R}}}}-\phi's_{perm}$;
// with complexity of $O(N^2-2N+2)\sim O(N^2)$
$\color{#2e8b57}{\mathbf{FOR}}$ ${\color{#696969}{\mathbf{true}}}$
| $\color{#6b8e23}{\mathbf{IF}}$ $\phi'<1$
| | // with complexity improved to $O(N^{2.5})$
| | $[i_1,i_2,\cdots,i_N]\leftarrow{\color{#696969}{\mathbf{find}}}(R',{\color{#696969}{\mathbf{bind}}}(\prod_{k=1}^N\tilde{r}_{ki_k}>0))$;
| | $R'\leftarrow R'/(1-\phi')$;
| | $\phi'\leftarrow\min_{1\leq k\leq N}[\tilde{r}_{ki_k}]$;
| | $\mathbf{s}_{perm}\leftarrow{\color{#696969}{\mathbf{to\_perm}}}([i_1,i_2,\cdots,i_N])$;
| | $R'\leftarrow R'-\phi's_{perm}$;
| $\color{#6b8e23}{\mathbf{ELSE}}$
| | $\color{#6b8e23}{\mathbf{BREAK}}$;
| $\color{#6b8e23}{\mathbf{END}}$
| $[{\color{#c71585}{\boldsymbol{\Phi}}},{\color{#c71585}{\boldsymbol{\mathscr{P}}}}].{\color{#696969}{\mathbf{emplaced\_back}}}[\phi',R']$;
$\color{#2e8b57}{\mathbf{END}}$

里面的 ${\color{#2e8b57}{\mathbf{FOR}}}$-${\color{#6b8e23}{\mathbf{IF}}}$ loop 也可以直接表示为一个带条件的 ${\color{#2e8b57}{\mathbf{FOR}}}$ loop，只是我自己觉得这样写会稍微清楚一些。这样子得到了分解后，就可以进一步在每一轮仲裁中，都以每一个分解后的置换矩阵对应的正数的倒数作为此交换的虚完成时间 $F_k^\ell=1/\phi_k$，作为一种方案，这里将所有 $F_k^\ell,k=1,\cdots,K$ 从小到大排序，每次选最小的那个作为这轮的连接方式，若这一轮选了 $F_k^\ell$，那么下一轮的 $F_k^{\ell+1}=F_k^\ell+1/\phi_k$ 就好了，这个复杂度显然可以由 $K\leq N^2-2N+2$ 看出，是 $O(\log N)$ (这种方法算是 packetized generalized processor sharing, PGPS 算法的简化版本)

为了得到一种既考虑 efficiency 又顾及 fairness 的调度方案，上面用到的速率矩阵 $R$ 需要仔细地进行测量，这显然是一个优化问题 (resource sharing problem)，首先记 guaranteed-rate services 为 $R^g=(r_{ij}^g)$，再以下面的 online measurement 方法迭代求得第 $n$ 次测量周期下的 demand rate $r_{ij}^e(n)$

\[\begin{equation} \begin{split} &r_{ij}^e(n+2)=(1-\alpha(n))r_{ij}^e(n+1)+\alpha(n)\bigg[\frac{A_{ij}(nT)-A_{ij}((n-1)T)}T\bigg],\\ &r_{ij}^e(0)=r_{ij}^e(1)=\frac1N, \end{split} \end{equation} \]

里面的 $\alpha(n)$ 算是可调整占比的常数，如果网络流变化很迅速的话，适当调高 $\alpha(n)$ 有利于调度算法快速适应变化，$T$ 为测量周期，$A_{ij}$ 为累计到达量。可以看出，我们所需要的最优解必须介于 demand rate $r_{ij}^e$ 与 guaranteed-rate $r_{ij}^g$ 之间，把它当作一种约束，这就转化为一个对于效益函数 $U(\cdot)$ 的优化问题

\[\begin{equation} \begin{split} &\max\ \ \ \ \ \ \ &U(r_{ij},i,j=1,\cdots,N)\sim\sum_{i=1}^Nr_{ij}\sim\text{fairness},\\ &\text{s.t.}& \left\{ \begin{split} &r_{ij}^g\leq r_{ij}\leq r_{ij}^e,\ \ &i,j=1,2,\cdots,N,\\ &\sum_{i=1}^Nr_{ij}\leq1,&j=1,2,\cdots,N,\\ &\sum_{j=1}^Nr_{ij}\leq1,&i=1,2,\cdots,N, \end{split} \right. \end{split} \end{equation} \]

效益函数的不同取值就代表了 efficiency 与 fairness，前者对应一个最大流问题，而后者可以用水漫法 (water filling) 进行求解，这就完成了整个调度算法的流程，据仿真结果显示 [3] 它能够达到 $\it100\%$ 的 throughput，只是平均队列长度要稍微高于 output buffered switch 的模拟值

这种 crossbar 结构也可以通过多层网络的连接而获得 scalability，基本思路就是将每一个 crossbar 模块看作一个单元，用网络拓扑将它们连接起来，例如可以用 Banyan 拓扑将 $\it4\times4$ 的 Birkhoff-von Neumann crossbar switches连接成一个二层网络，然而这种结构具有三条关于 no overbooking 的约束 (前面只有两个)，其速率受到限制，为此可以再加一层网络，这相当于使用十二个 $\it4\times4$ 的置换矩阵来表示一个 $\it16\times16$ 的置换矩阵，在这里的调度算法就需要先求解优化问题得到 allocated rate matrix $R$，只不过这个 $R$ 应该是一个 $\it16\times16$ 的矩阵，对它作分解得到 $\it16\times16$ 的 $K$ 个置换矩阵们，再以 Slepian-Duguid 算法将它们分解为 $\it12K$ 个 $\it4\times4$ 的置换矩阵，将它们反过来合成 $\it1\sim12$ crossbar 对应的 $R$ 阵，最后用 Birkhoff-von Neumann 进行分解求得调度权值进行调度即可

[0.0] 面号：[OPTSxca75]

[1] Optical switching in next generation data centers[M]. Springer, 2017.

[1.1] Calabretta N, Miao W. Optical switching in data centers: Architectures based on optical packet/burst switching[M]//Optical Switching in Next Generation Data Centers. Springer, Cham, 2018: 45-69.

[1.2] Huang Q. Commercial Optical Switches[M]//Optical Switching in Next Generation Data Centers. Springer, Cham, 2018: 203-219.

[2] Chao H J, Liu B. High performance switches and routers[M]. John Wiley & Sons, 2007.

[3] Chang C S, Chen W J, Huang H Y. Birkhoff-von Neumann input buffered crossbar switches[C]//Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No. 00CH37064). IEEE, 2000, 3: 1614-1623.

[4] Yan F, Miao W, Raz O, et al. Opsquare: A flat DCN architecture based on flow-controlled optical packet switches[J]. IEEE/OSA Journal of Optical Communications and Networking, 2017, 9(4): 291-303.

[5] Kitayama K I, Huang Y C, Yoshida Y, et al. Torus-topology data center network based on optical packet/agile circuit switching with intelligent flow management[J]. Journal of Lightwave Technology, 2015, 33(5): 1063-1071.

[6] https://math.stackexchange.com/questions/47175/does-birkhoff-von-neumann-imply-any-of-the-fundamental-theorems-in-combinatori

付録┍1┙

这里稍微严格一点写一下先前提到的 Birkhoff-von Neumann 定理，首先是文字表述，也即双随机矩阵都包含于置换矩阵的凸包 (convex hull)，反之亦然，那么现在记一个双随机矩阵 $R=[r_{ij}]$，则定理表述为

\[\begin{equation} \begin{split} R\in\bigcup_{K=1}^{n!}\bigg\{\sum_{k=1}^K\phi_kP_k:\sum_{k=1}^K\phi_k=1,\alpha_k\in\R_+,P_k\in S_n\bigg\}\in{\rm co}(S_n), \end{split} \end{equation} \]

不失普遍性，可以令 $M\leq n^2-n+1$，其证明过程用到 Frobenius‐König 定理 [6] ，思路如下

对于 $\boxplus(R)=|\{(i,j):r_{ij}\neq0\}|$ 进行归纳，$\boxplus(R)=n$ 对应着 $R$ 置换矩阵的情形，此时若假设以上命题对 $\boxplus(\cdot)=N$ 成立，考虑 $\boxplus(R)=N>n$，下列条件满足

\[\begin{equation} \begin{split} \forall1\leq i\leq n,\mathcal{I}\in\bigg(\begin{matrix}[n]\\r\end{matrix}\bigg),\mathcal{J}\in\bigg(\begin{matrix}[n]\\s\end{matrix}\bigg),{\ \text{s.t.}\ }r+s=n+1\to R[\mathcal{I},\mathcal{J}]\neq\mathbf{0}_{r\times s}, \end{split} \end{equation} \]

则由 Frobenius‐König 定理，总能取置换 $\tau\in S_n{\ \text{s.t.}\ }R[i,\tau(i)]\neq0$，这里记 $\phi=\min_{1\leq i\leq n}R[i,\tau(i)]\in(0,1)$，则有分解 $R=\phi\tau+(1-\phi)R'$，其中的 $R'=(1-\phi)^{-1}(R-\alpha\tau)$ 仍为双随机矩阵，又有 $\boxplus(R')<\boxplus(R)=N$，由归纳假设得证 $\blacksquare$

Switch Type	Switching Time	Scale	Loss
Mach-Zehnder interferometer (MZI, thermo-optic)	\(\sim\it10\rm\ \mu s\)	\(\it32\times32\)	High
micro-ring resonator (MRR, thermo-optic)	\(\sim\it10\rm\ \mu s\)	\(\it8\times8\)	Fair
PLZT MZI	\(\sim\it10\rm\ ns\)	\(\it8\times8\)	High
InP MZI	\(\sim\it2.5\rm\ ns\)	\(\it8\times8\)	High
LiNbO₃ MZI	\(\sim\it1\rm\ ns\)	\(\it32\times32\)	High
MZI + electro-absorption modulator (EAM)	\(<\it10\rm\ ns\)	\(\it8\times8\)	Fair
tunable wavelength converters (TWCs) + arrayed waveguide grating router (AWGR)	\(\sim\it10s\rm\ ns\)	\(\it10s\times10s\)	Fair
tunable lasers (TLs) + AWGR	\(\sim\it10s\rm\ ns\)	\(\it100s\times100s\)	Low
semiconductor optical amplifier (SOA) multistage	\(<\it10\rm\ ns\)	\(\it16\times16\)	Fair

Characteristics	MEMS based	Beam-Steering	Liquid crystal	Electro-optic	SOA based	Thermo-optic
Switching speed	\(\it10\sim20\rm\ ms\)	\(\it25\rm\ ms\)	\(\it100\rm\ ms\)	\(\mathbf{\sim\ ns}\)	\(\mathbf{\sim\ ns}\)	\(\it5\sim10\rm\ ms\)
Insertion loss	Medium	Low	High	High	Low	Low
Power efficiency	Medium	Low	High	High	Low	Low
Scalability	Large (\(\it320\times320\))	Large (\(\it384\times384\))	Medium (\(\it1\times20\))	Medium (\(\it32\times32\))	Medium (\(\it16\times16\))	Small (\(\it8\times8\))
Reliability	Low	Low	High	High	Medium	High
Implementation cost	Medium	Medium	Low	High	High	Low

码农公寓

Birkhoff-von Neumann Crossbar 光交换网络的调度方案

Network Architecture and Optical Switching Technology

Demands of Optical Data Center Networks

Optical Switches

Challenges of Optical DCN Solution

Switching Architectures and Scheduling Algorithms

Basic Knowledge on Packet Switching

Performance Analysis

Input-Buffered Switches

Output-Buffered Switches

Completely Shared-Buffered Switches

Scheduling Algorithms

Crossbar Switches as Example

付録┍1┙

相关文章