基于神经网络视频编码的通用测试规范

2024-01-27 14:05:16

JVET第20次会议为基于神经网络的编码成立了EE1，这个EE专门用于探索深度学习在视频编码方面的潜能。为了规范和统一测试条件，JVET专门制定了相应的通用测试条件（Common Test Conditions ，CTC），最新版CTC为JVET-X2016（公众号后台回复“JVET-X2016”获取）。CTC规定了配置、测试序列、训练序列、参考软件、训练方法、评价指标等。所有基于神经网络的提案都需要按照CTC进行测试才能提交。

配置

CTC提供了四种配置，包括ntra-only,random-access, 和 low-delay配置：

· Intra, 10 bit

· Random access, 10 bit

· Low delay, 10 bit

· Low delay, P slices only, 10 bit

在实验中可以仅对其中部分配置进行测试，但至少要包括应该RA配置或LD配置。

测试序列

测试序列如表1所示，对于RA和LD配置，所有序列的所有帧都需要测试，对于intra配置仅需测试前8帧。

表1 测试序列

Class	Sequence name	Frame count	Resolution	Frame rate	Bit depth	Intra	Random access	Low-delay
A1	Tango2	294*	3840x2160	60	10	M	M	O*
A1	FoodMarket4	300*	3840x2160	60	10	M	M	O*
A1	Campfire	300*	3840x2160	30	10	M	M	O*
A2	CatRobot	300*	3840x2160	60	10	M	M	O*
A2	DaylightRoad2	300*	3840x2160	60	10	M	M	O*
A2	ParkRunning3	300*	3840x2160	50	10	M	M	O*
B	MarketPlace	600	1920x1080	60	10	M	M	M
B	RitualDance	600	1920x1080	60	10	M	M	M
B	Cactus	500	1920x1080	50	8	M	M	M
B	BasketballDrive	500	1920x1080	50	8	M	M	M
B	BQTerrace	600	1920x1080	60	8	M	M	M
C	RaceHorses	300	832x480	30	8	M	M	M
C	BQMall	600	832x480	60	8	M	M	M
C	PartyScene	500	832x480	50	8	M	M	M
C	BasketballDrill	500	832x480	50	8	M	M	M
D	RaceHorses	300	416x240	30	8	M	M	M
D	BQSquare	600	416x240	60	8	M	M	M
D	BlowingBubbles	500	416x240	50	8	M	M	M
D	BasketballPass	500	416x240	50	8	M	M	M
E	FourPeople	600	1280x720	60	8	M	-	M
E	Johnny	600	1280x720	60	8	M	-	M
E	KristenAndSara	600	1280x720	60	8	M	-	M
F	ArenaOfValor	600	1920x1080	60	8	M	M	M
F	BasketballDrillText	500	832x480	50	8	M	M	M
F	SlideEditing	300	1280x720	30	8	M	M	M
F	SlideShow	500	1280x720	20	8	M	M	M
H2	DayStreet2	300	3840x2160	60	10	O	O	-
H2	FlyingBirds3	300	3840x2160	60	10	O	O	-
H2	PeopleInShopping Center2	300	3840x2160	60	10	O	O	-
H2	SunsetBeach3	300	3840x2160	60	10	O	O	-

每列含义如下：

A1、A2测试序列在LD配置下编码时应编码帧数为帧率的三倍。
“M”表示在该配置下必须测试这条序列。
“O”表示在该配置下该测试序列可选。
"-"表示该配置下不需要测试该序列。
A1、A2、B、C、D、E、F下载地址：

ftp://jvet@ftp.hhi/fraunhofer.de/ctc/sdr或

ftp://jvet@ftp.ient.rwth-aachen.de/ctc/sdr
H2下载地址：ftp://jvet@ftp.ient.rwth-aachen.de/ctc/hdr

（注：测试序列仅对会员开放）

参考软件

基于神经网络的视频编码参考软件为VTM NNVC，最新版本为VTM11.0_nnvc-1.0，获取地址：https://vcgit.hhi.fraunhofer.de/jvet-ahg-nnvc/VVCSoftware_VTM/-/tree/VTM-11.0_nnvc

参数配置

下面是一些编码器的基本参数：

InputFile：指定输入序列路径。
FrameRate：指定编码帧率。
SourceWidth：指定序列宽度。
SourceHeight：指定序列高度。
FramesToBeEncoded：指定编码帧数。
IntraPeriod：指定RA配置下帧内刷新间隔，该值依赖于帧率和GOP size，当帧率为20fps,24fps,25fps,30fps时使用32，当帧率为50fps,60fps时使用64，当帧率为100fps时使用96。
QP：设定量化参数。
InputBitDepth：设定位深。

VTM NNVC提供了如下配置文件：

· “AllIntra” (AI): encoder_intra_vtm.cfg

· “Randomaccess” (RA): encoder_randomaccess_vtm.cfg

· “Low-delayB” (LB): encoder_lowdelay_vtm.cfg

· “Low-delay P” (LP, optional):encoder_lowdelay_P_vtm.cfg

并行编码/解码

对于RA配置，可以使用JVET-B0036中的并行编解码技术。在测试报告中应说明是否使用了并行技术。

在使用时域预滤波（开启TemporalFilter配置）时，编码器在编码开始前需要两帧编码结束后需要两帧，这些额外的帧需要保证编码器能获取。

当使用并行时，有以下两种方法计算PSNR和MS-SSIM：

使用VTM NNVC提供的parcat工具将码流片段连接起来，然后计算完整的解码YUV序列和输入序列的PSNR和MS-SSIM。
计算每帧的PSNR，然后在整个序列上算平均值。计算每帧PSNR时需要用64位精度。命令行参数-PrintHexPSNR可以输出十六进制PSNR值，如下所示，

十六进制可以很容易转化为浮点数，C语言操作如下，

python操如下：

Ruby操作如下：

MS-SSIM计算类似，可以使用命令行参数 ‑PrintMSSSIM 和 ‑PrintHexMSSSIM

编译时配置

编译时配置在source/Lib/TLibCommon目录下的TypdDef.h中。

测试和推导设置

时域预测

三种配置进一步做如下规定：

All Intra：不同帧间不进行预测。
Random Access：延迟不超过64帧且随机访问间隔不超过1.6秒。
Low Delay：解码和输出顺序不需要重排。
Low Delay P：解码和输出顺序不需要重排，且不进行双向预测。

位深

输入和输出必须是10比特，如果输入序列是8比特，需要转化为10比特。

QP

通过两种方式配置QP：

Fixed QP：需要测试5个QP，22,27, 32, 37, 42。这些是初始QP，如果使用AQ可能会改变每帧的QP。
目标码率：对于某些应用如超分辨率、端到端等没有量化的概念，此时需要提供5个和QP={22,27, 32, 37, 42}对于的编码点，每个点的码率误差在10%以内。

训练条件

训练序列可以从https://vcgit.hhi.fraunhofer.de/jvet-ahg-nnvc/nnvc-ctc获取。

如果仅使用了部分序列进行训练需要在提案中说明使用了哪些序列。如果将数据集划分为训练集和验证集需要说话划分策略。如果使用了额外的序列进行训练也需要说明。

指标

提案中应包括的指标为码率、PSNR、编解码时间、BD-rate结果。强烈建议包含MS-SSIM。一些指标计算如下：

PSNR：

bitDepth为位深，必须为10，8比特序列需要转化为10比特序列。

MS-SSIM：

命令行 -PrintMSSSIM可以计算MS-SSIM，也可以用开源工具HDRTools（https://gitlab.com/standards/HDRTools）计算。一个计算PSNR和MS-SSIM的命令示例如下：

HDRMetricsYUV.cfg 是配置文件，在HDRTools中。

MS-SSIM计算原理如上图，L是低通滤波，2 ↓是1/2下采样。计算公式如下，

网络信息

提案中还需提供神经网络的信息，包括网络结构、实现、训练等。

计算平台

对于训练和网络推导使用的环境需要提供以下信息，

· GPU Type:(e.g. GPU: GTX 1080ti x 4 x 12GB, etc.)

· CPU Type: (e.g. Xeon …)

· Framework: Neural network framework in the inference stage. (e.g. Caffe, TensorFlow, PyTorch, etc. and version)

训练信息

提案中需要提供训练过程中的如下信息，

· Epoch:Number of complete passes through the training data (e.g. 100)

· Batch size:Number of samples processed before the model is updated. (e.g. 4Kx16frames)

· Training time: CPU and/or GPU (e.g. 48h)

· Learning curve: Plot of the training loss and validation loss (or similar) versus the number of epochs

· Loss function:Function used to calculate the model error during training and optimization (e.g. L1, L2, etc.)

· Training sequences: Sequences used for training

· Training configuration per rate-distoration point: Any changes in the requested information used to generate different rate-distoration points

Additional training information could also help to better understand the proposed neural network-based method and thus encouraged to be included in the contribution.

· Number of Iterations: Number of gradient updates within an epoch

· Patch size:Size of input to the neural networks (patchW×patchH×patchT, e.g. 64x64x3)

· Learning rate:Amount that the weights are updated during training (e.g. 5e-4)

· Mini-batch Selection Procedure: Description of mini-batch selection procedure

· Optimizer:Algorithm used to change the attributes of proposed neural networks (e.g. ADAM)

· Preprocessing:(e.g. preprocessing procedure, normalization, cropping method, rotation, zoom etc.)

推导信息

在许多应用的推理阶段，复杂性很关键。除了编解码时间，还需提供推导过程中的如下信息，

· Network Visualization: 网络可视化Graphical representation of the neural network网络可视化

· Param. Number (Each): 每个神经网络模型参数数量Number of parameters for each model in the neural network.

· Param. Number (All): 所有网络模型总共的参数数量Total numbers of parameters for all models in the neural network.

· Param. Precision: 存储一个参数所需的比特数Bits for storing one parameter. Besides, using “I” for indicating the integer, and using “F” for indicating the floating number. For example, if the proposed method uses 16-bit integer to represent a parameter, you can report this information as “16 (I)”.

· kMAC :最坏情况下每100个像素所需乘法累加操作的数量 Number of multiply–accumulate (MAC) operations per 1,000 samples in the worst-case for the inference stage. Note: A sample corresponds to one luma value or one chroma value. Additionally, please note that an example script for calculating the number of multiply-accumlate (kMAC) operations is available at https://vcgit.hhi.fraunhofer.de.jvet-ahg-nnvc/nnvc-ctc

提案中不强制但是建议提供以下信息，

· Total Conv. Layers: 网络中总共的卷积层数Total numbers of convolutional layers in the network structure. If there is no convolutional layer, just fill in 0.

· Total FC Layers:网络中总共的全连接层数 Total number of fully-connected layers in the network structure. If there is no fully-connected layer, just fill in 0.

· Mem.T (MB): 所需内存Temporary memory used for inference. The temporary memory shall be reported as (i) the total memory size required for inference and (ii) the maximum memory size per model, if the proposal employs multiple network models in its design. The memory size shall be caluculated for an input image size of 3840x2160, if there is no parallel operation. Or, if block level parallel operation is used in the proposed method, the block size can be used as the input for calculation, while the input size should be reported.

· Batch size:The number of samples processed in parallel during inference. (e.g. 4Kx16frames)

· Patch size:Size of input to the neural networks during inference (patchW×patchH×patchT, e.g. 64x64x3)

· Border handling:Description on boundary handing, if inference operates on a block (or CTU) basis.

编解码时间

提案中还需要提供编解码时间，包括CPU上处理的时间以及GPU处理时间。还需附带计算编解码时间的环境的信息，包括patch size、batch size、GPU数量、CPU核数等。在计算编解码时间时anchor和提案中的算法应在同一环境中。

感兴趣的请关注微信公众号Video Coding

码农公寓

配置