我已经有两年 ML 经历,这系列课主要用来查缺补漏,会记录一些细节的、自己不知道的东西。
已经有人记了笔记(很用心,强烈推荐):https://github.com/Sakura-gh/ML-notes
本节内容综述- 本节课由助教黄冠博讲解。将分为影像与语音两部分讲解。
- One Pixel Attack,仅仅改变一个像素,就进行了***。着重讲了 Differential Evolution 。
- 接下来是 Adversarial Attack Outline 部分。
- 第一小节是 Attacks on ASR。Automatic Speech Recognition,自动语音识别技术。
- 第二小节是 Attacks on ASV。Automatic Speaker Verification,自动演讲人识别技术。
- 稍微提及了 Wake Up Words 。
- 第三节是重点,Hidden Voice Attack。比如,制作一段噪音,人类听不出,但是可能是某个机器指令,如Hey, Siri。
文章目录
小细节
One Pixel Attack
如上,为两种***所产生的噪音。两者优化目标一样。
如上,都是One Pixel Attack的例子。
如上,与上节所讲的内容相同,分为 untargeted attack 与 target attack 。One Pixel Attack的特点是,只能有一个 x x x 与之前不同。
如果遍历每一个像素,则耗时太长。因此,我们使用 Differential Evolution 。或者说,我们无需找到最好的像素。
Differential Evolution
During each iteration another set of candidate solutions(children)is generated according to the current population (parents). Then the children are compared with their corresponding parents, surviving if they are more fitted(possess higher fitness value) than their parents. In such a way, only comparing the parent and his child, the goal of keeping diversity and improving fitness values can be simultaneously achieved.
有些类似保留精英的遗传算法,其优势在于:
- Higher probability of Finding Global Optima: due to diversity keeping mechanisms and the use of a set of candidate solutions
- Require Less Information from Target System: 相比 FGSM,DE 不需要算 gradient ,因此不需要***对象 model 太多细节;independent of the classifier used
整体来说就是:
- Initialize Candidates
- Select Candidates and Generate
- Test Candidates and Substitute
后两步不断重复。
DE 并非将一个图片的一个数值当成一个“个体”,而是将一个像素的全部信息作为***目标: ( x , y , R , G , B ) (x, y, R, G, B) (x,y,R,G,B)。
由此可推测,图片越大,则***成功率越低。
Attacks on ASR
可参考:https://nicholas.carlini.com/code/audio_adversarial_examples
如上,与图片***类似,加上一段杂音,让神经网络错误判断。
Attacks on ASV
如上,语音识别的分类问题同理,也可通过加噪音***。
Wake Up Words
Hidden Voice Attack
如上,助教播放了一段杂音,实际上代表的是"turn on the computer"。
Psychoacoustics
心理声学,研究人对于声音的感知程度跟反应。
Signal Preprocessing
对于声音信号的处理,我们进行了:
- 声音信号的采样;
- 预处理,滤掉高低频的内容;
- 对信号进行运算处理;
- 输入模型。
Perturbation
如图,将介绍四种方式,进行***。
Time Domain Inversion (TDI)
- 利用了 mFFT(magnitude fft) 多对一的性质;
- two completely different signals in the time domain can have similar spectra
- modifying the audio in the time domain while preserving its spectrum, by inverting the windowed signal
- inverting small windows across the entire signal, removes the smoothness
意义是什么呢?
- 我们虽然对音讯进行了修改;
- 但是其 mFFT 还是与原音讯相同。
Random Phase Generation
如图,调整 a a a 与 b b b ,但是仍保证 a 2 + b 2 \sqrt{a^2 + b^2} a2+b2 不变。
High Frequency Addition(HFA)
- Signal processing 的过程中, low-pass filter 会把相较于人耳高很多的频段滤掉以增加 VPS(Voice Processing System) 的准确率
- add high frequencies to the audio that are filtered out during the preprocessing stage
- create high frequency sine waves and add it to the real audio
- If the sine waves have enough intensity, it has the potential to mask the underlying audio command to the human ear.
Time Scaling (TS)
- 将音讯快放到 model 能正确辨识但是人又听不太懂在说什么
- compress the audio in the time domain by discarding unnecessary samples and maintain the same sample rate
- the audio is shorter in time, but retains the same spectrum as the original
什么是 sample rate ?波由好多点组成,simple rate 就是每秒有几个 data point 。