前面简单介绍了row hammer攻击的原理和方法,为了更好理解这种底层硬件类攻击,今天介绍一下cpu的cache mapping;
众所周知,cpu从内存读数据,最开始用的是虚拟地址,需要通过分页机制,将虚拟地址转换成物理地址,然后从物理地址(默认是DRAM,俗称内存条)读数据;但内存条速度和cpu相差近百倍,由此诞生了L1\L2\L3 cache;cpu取数据时,会先从各个层级的cache去找,没有的再从内存取;那么问题来了,L3 cache里面有set、slice、line等模块将整个cache划分成一个一个64byte的cache line,cpu是怎么根据物理地址从L3 cache中取数据的了?比如8MB的L3 cache,一共有8MB/64byte = 2,097,152个cache line,cpu怎么根据物理地址精确地找到目标cache line了?
1、直接映射(单路相连)
假如物理地址是0x654,这个地址对应的L3 cache的哪个存储单元了?先看一种最简单的情况:
- 假如有8个cache line,需要3bit遍历,中间标黄的010就是cache line之间的index;
- 假如每个cache line 长度是8byte,同样只需要3bit就能遍历所有bbyte,标蓝的就是cache line内部的offset
- 剩下标绿的11001就是tag;cpu额外有个tag array,通过index取出tag array中的tag,和11001对比,如果是,说明这个byte就是该物理地址对应的存储单元,可以马上取数据了,这叫cache hit;否则称为cache miss;
直接映射有缺陷:如果两个物理地址的index和offset都一样,但tag不同,也会映射到同一个cache line,增加了刷新cache的时间成本。由此产生了改进的方法,
2、两路相连
和1的直连比,仅仅把tag array和cache line组均分成2分,offset和index寻址不变,仅仅是tag对比改变:这里由于分了两组,所以会有2个tag,只要物理地址的tag和其中一个相同,就算cache hit;相当于多了一次tag比对的机会,增加了命中概率;比如物理地址的tag=0x32,和tag array左边那个是一样的,那么cache line就用way0的;
如果继续分组,比如4组,就是4way;8组就是8way了,以此类推(后面我在kali上做实验,查到cache是8way的,也就是说每个物理地址的tag都有8次对比的机会,命中的概率还是蛮大的);
再举例,比如缓存总大小32 KB,由4路组相连cache,cache line大小是32 Bytes,该怎么划分了?
- 总大小32KB,由4路,每路8KB;
- 每个cache line 32byte,那么一共有8KB/32byte=256个,所以index至少8bit;
- 每个cache line 32byte,offset至少5bit;
整个规划架构如下:
3、全连接
所有的cache line都在一个组内,因此地址中不需要index部分;可根据地址中的tag部分和所有的cache line对应的tag进行比较(硬件上可能并行比较也可能串行比较),哪个tag比较相等,就命中某个cache line,所以在全相连缓存中,任意地址的数据可以缓存在任意的cache line;但这么做成本很高;
4、前面介绍3中cache mapping的方法,一旦出现cache miss,cpu会怎么做了?
假设我们有一个64 Bytes大小直接映射缓存,cache line大小是8 Bytes,采用写分配和写回机制。当CPU从地址0x2a读取一个字节,cache中的数据将会如何变化呢?假设当前cache状态如下图所示(tag旁边valid一栏的数字1代表合法。0代表非法。后面Dirty的1代表dirty,0代表没有写过数据,即非dirty);
根据index找到对应的cache line,对应的tag部分valid bit是合法的,但是tag的值不相等,因此发生cache miss。此时我们需要从地址0x28(8字节对齐)地址加载8字节数据到该cache line中(cache line是缓存最小的读写单元);但是,我们发现当前cache line的dirty bit置位(表示),所以cache line里面的数据不能被简单的丢弃;由于采用写回机制,所以我们需要将cache中的数据0x11223344写到地址0x0128地址(tag:0x04 index:101 offset:010,连接起来就是100 101 010=0x12a,考虑到8字节对齐,就从0x128开始);
当写回操作完成,再将主存中0x28地址开始的8个字节加载到该cache line中,并清除dirty bit。然后根据offset找到0x52返回给CPU;
5、 cache mapping测试
https://github.com/google/rowhammer-test/tree/master/cache_analysis 这里有现成的代码,可以直接用;
核心思路:分配虚拟空间->转成物理地址->每隔一页再生成物理地址->这两个地址在同一个cache set吗? -> 如果是就保留->从该保留的地址读10次数据,保留每次耗时->取中位数;
本人实验环境kali下查看cpu的ways_of_associate是8,关联度就是8;
代码中:尝试的地址个数:int max_addr_count = 9 * 4 就可以在8附近(比如5~11)多取几个值对比看看结果;(原作则是12way的,用不同数量地址反复做测试,发现地址数量大于13后耗时明显增加很多,也就是cache missing激增)
// Copyright 2015, Google, Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. #include <assert.h> #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <sys/mman.h> #include <time.h> #include <unistd.h> #include <algorithm> // This program attempts to pick sets of memory locations that map to // the same L3 cache set. It tests whether they really do map to the // same cache set by timing accesses to them and outputting a CSV file // of times that can be graphed. This program assumes a 2-core Sandy // Bridge CPU. // Dummy variable to attempt to prevent compiler and CPU from skipping // memory accesses. int g_dummy; namespace { const int page_size = 0x1000; int g_pagemap_fd = -1; // Extract the physical page number from a Linux /proc/PID/pagemap entry. uint64_t frame_number_from_pagemap(uint64_t value) { return value & ((1ULL << 54) - 1); } void init_pagemap() { g_pagemap_fd = open("/proc/self/pagemap", O_RDONLY); assert(g_pagemap_fd >= 0); } /*虚拟地址转成物理地址*/ uint64_t get_physical_addr(uintptr_t virtual_addr) { uint64_t value; /*virtual_addr=16<<20;page_size=4096,sizeof(value)=8,offset=4096*8*/ off_t offset = (virtual_addr / page_size) * sizeof(value); int got = pread(g_pagemap_fd, &value, sizeof(value), offset); assert(got == 8); // Check the "page present" flag. assert(value & (1ULL << 63)); uint64_t frame_num = frame_number_from_pagemap(value); return (frame_num * page_size) | (virtual_addr & (page_size - 1)); } // Execute a CPU memory barrier. This is an attempt to prevent memory // accesses from being reordered, in case reordering affects what gets // evicted from the cache. It‘s also an attempt to ensure we‘re // measuring the time for a single memory access. // // However, this appears to be unnecessary on Sandy Bridge CPUs, since // we get the same shape graph without this. inline void mfence() { asm volatile("mfence"); } // Measure the time taken to access the given address, in nanoseconds. int time_access(uintptr_t ptr) { struct timespec ts0; int rc = clock_gettime(CLOCK_MONOTONIC, &ts0); assert(rc == 0); g_dummy += *(volatile int *) ptr; mfence(); struct timespec ts; rc = clock_gettime(CLOCK_MONOTONIC, &ts); assert(rc == 0); return (ts.tv_sec - ts0.tv_sec) * 1000000000 + (ts.tv_nsec - ts0.tv_nsec); } // Given a physical memory address, this hashes the address and // returns the number of the cache slice that the address maps to. // // This assumes a 2-core Sandy Bridge CPU. // // "bad_bit" lets us test whether this hash function is correct. It // inverts whether the given bit number is included in the set of // address bits to hash. int get_cache_slice(uint64_t phys_addr, int bad_bit) { // On a 4-core machine, the CPU‘s hash function produces a 2-bit // cache slice number, where the two bits are defined by "h1" and // "h2": // // h1 function: // static const int bits[] = { 18, 19, 21, 23, 25, 27, 29, 30, 31 }; // h2 function: // static const int bits[] = { 17, 19, 20, 21, 22, 23, 24, 26, 28, 29, 31 }; // // This hash function is described in the paper "Practical Timing // Side Channel Attacks Against Kernel Space ASLR". // // On a 2-core machine, the CPU‘s hash function produces a 1-bit // cache slice number which appears to be the XOR of h1 and h2. // XOR of h1 and h2: static const int bits[] = { 17, 18, 20, 22, 24, 25, 26, 27, 28, 30 }; int count = sizeof(bits) / sizeof(bits[0]); int hash = 0; for (int i = 0; i < count; i++) { hash ^= (phys_addr >> bits[i]) & 1; } if (bad_bit != -1) { hash ^= (phys_addr >> bad_bit) & 1; } return hash; } bool in_same_cache_set(uint64_t phys1, uint64_t phys2, int bad_bit) { // For Sandy Bridge, the bottom 17 bits determine the cache set // within the cache slice (or the location within a cache line). uint64_t mask = ((uint64_t) 1 << 17) - 1; return ((phys1 & mask) == (phys2 & mask) && get_cache_slice(phys1, bad_bit) == get_cache_slice(phys2, bad_bit)); } int timing(int addr_count, int bad_bit) { size_t size = 16 << 20; uintptr_t buf = (uintptr_t) mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);//分配内存 assert(buf); uintptr_t addrs[addr_count]; addrs[0] = buf; uintptr_t phys1 = get_physical_addr(addrs[0]); // Pick a set of addresses which we think belong to the same cache set.两个物理地址的跨度是1页; uintptr_t next_addr = buf + page_size; uintptr_t end_addr = buf + size; int found = 1; while (found < addr_count) { assert(next_addr < end_addr); uintptr_t addr = next_addr; next_addr += page_size; uint64_t phys2 = get_physical_addr(addr); if (in_same_cache_set(phys1, phys2, bad_bit)) { addrs[found] = addr; found++; } } // Time memory accesses. int runs = 10; int times[runs];//记录地址的读取耗时 for (int run = 0; run < runs; run++) { // Ensure the first address is cached by accessing it. g_dummy += *(volatile int *) addrs[0]; mfence(); // Now pull the other addresses through the cache too. for (int i = 1; i < addr_count; i++) { g_dummy += *(volatile int *) addrs[i]; } mfence(); // See whether the first address got evicted from the cache by // timing accessing it. times[run] = time_access(addrs[0]); } // Find the median time. We use the median in order to discard // outliers. We want to discard outlying slow results which are // likely to be the result of other activity on the machine. // // We also want to discard outliers where memory was accessed // unusually quickly. These could be the result of the CPU‘s // eviction policy not using an exact LRU policy. std::sort(times, ×[runs]); int median_time = times[runs / 2]; int rc = munmap((void *) buf, size); assert(rc == 0); return median_time; } int timing_mean(int addr_count, int bad_bit) { int runs = 10; int sum_time = 0; for (int i = 0; i < runs; i++) sum_time += timing(addr_count, bad_bit); return sum_time / runs; } } // namespace int main() { init_pagemap(); // Turn off stdout caching. setvbuf(stdout, NULL, _IONBF, 0); // For a 12-way cache, we want to pick 13 addresses belonging to the // same cache set. Measure the effect of picking more addresses to // test whether in_same_cache_set() is correctly determining whether // addresses belong to the same cache set. int max_addr_count = 9 * 4;//cpu是8way的,这用9 bool test_bad_bits = true; printf("Address count"); printf(",Baseline hash (no bits changed)"); if (test_bad_bits) { for (int bad_bit = 17; bad_bit < 32; bad_bit++) { printf(",Change bit %i", bad_bit); } } printf("\n"); for (int addr_count = 0; addr_count < max_addr_count; addr_count++) { printf("%i", addr_count); printf(",%i", timing_mean(addr_count, -1)); if (test_bad_bits) { for (int bad_bit = 17; bad_bit < 32; bad_bit++) { printf(",%i", timing_mean(addr_count, bad_bit)); } } printf("\n"); } return 0; }
参考:http://lackingrhoticity.blogspot.com/2015/04/l3-cache-mapping-on-sandy-bridge-cpus.html L3 cache mapping on Sandy Bridge CPUs
https://zhuanlan.zhihu.com/p/102293437 Cache的基本原理
最后整理了一个脑图,方便串联理解各个要点: