我正在开发一个应该在ARMv7处理器设备上运行的原生Android应用程序.
出于某些原因,我需要对向量(短和/或浮点)进行一些繁重的计算.
我使用NEON命令实现了一些汇编功能来增强计算.我已经获得了1.5速度因素,这也不错.我想知道我是否可以更快地改进这些功能.
所以问题是:我可以做些什么改进来改善这些功能?
//add to float vectors.
//the result could be put in scr1 instead of dst
void add_float_vector_with_neon3(float* dst, float* src1, float* src2, int count)
{
asm volatile (
"1: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vadd.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 1b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//multiply a float vector by a scalar.
//the result could be put in scr1 instead of dst
void mul_float_vector_by_scalar_with_neon3(float* dst, float* src1, float scalar, int count)
{
asm volatile (
"vdup.32 q1, %[scalar] \n"
"2: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vmul.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 2b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [scalar] "r" (scalar), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//add to short vector -> no problem of coding limits
//the result should be put in in a dest different from src1 and scr2
void add_short_vector_with_neon3(short* dst, short* src1, short* src2, int count)
{
asm volatile (
"3: \n"
"vld1.16 {q0}, [%[src1]]! \n"
"vld1.16 {q1}, [%[src2]]! \n"
"vadd.i16 q0, q0, q1 \n"
"subs %[count], %[count], #8 \n"
"vst1.16 {q0}, [%[dst]]! \n"
"bgt 3b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//multiply a short vector by a float vector and put the result bach into a short vector
//the result should be put in in a dest different from src1
void mul_short_vector_by_float_vector_with_neon3(short* dst, short* src1, float* src2, int count)
{
asm volatile (
"4: \n"
"vld1.16 {d0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vmovl.s16 q0, d0 \n"
"vcvt.f32.s32 q0, q0 \n"
"vmul.f32 q0, q0, q1 \n"
"vcvt.s32.f32 q0, q0 \n"
"vmovn.s32 d0, q0 \n"
"subs %[count], %[count], #4 \n"
"vst1.16 {d0}, [%[dst]]! \n"
"bgt 4b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "d0", "q0", "q1"
);
}
提前致谢 !
解决方法:
您可以尝试展开循环以处理每个循环的更多元素.
你的add_float_vector_with_neon3代码每4个元素需要10个周期(因为停止),而展开到16个元素需要21个周期.
http://pulsar.webshaker.net/ccc/sample-34e5f701
虽然存在开销,因为您需要处理剩余部分(或者您可以将数据填充为16的倍数),但如果您有大量数据,则与实际总和相比,开销应该相当低.