auto vectorized case shift

code 大概长这样

inline int64_t RoundUpToPowerOfTwo(int64_t v) {
    --v;
    v |= v >> 1;
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;
    v |= v >> 32;
    ++v;
    return v;
}

void foo(int64_t* src, int64_t* dst, int len) {
    for (int i = 0; i < len; i++) {
        dst[i] = RoundUpToPowerOfTwo(src[i]);
    }
}
编译参数
$ g++ -fopt-info-vec-optimized -O3 -g -fopt-info-vec-optimized ans.cpp -std=c++11 -mavx2
没输出
$ objdump -d ./a.out |less 
...
发现没相关vectorized指令,但是这个 RoundUpToPowerOfTwo 的确是内联了,中间没有函数调用

添加 __restrict 参数也没作用

经过排查发现右移是无法向量化的

void foo(int64_t* src, int64_t* dst, int len) {
    for (int i = 0; i < len; i++) {
        dst[i] = src[i] >> 1;
    }
}

查阅资料发现左移是可以向量化的

解决思路:

// 把输入改成uint64_t
void foo(uint64_t* src, uint64_t* dst, int len) {
    for (int i = 0; i < len; i++) {
        dst[i] = src[i] >> 1;
    }
}

inline uint64_t RoundUpToPowerOfTwo(uint64_t v);

ans.cpp:46:23: optimized: loop vectorized using 32 byte vectors
ans.cpp:46:23: optimized: loop versioned for vectorization because of possible aliasing
ans.cpp:46:23: optimized: loop vectorized using 16 byte vectors

auto vectorized case shift

上一篇:TX9416内置MOS同步降压芯片,3.3V-16V宽电压输入,2A连续输出电流


下一篇:pyspark写入hive分区表