A 570 though would be significantly faster (but also run significantly hotter). I am still trying to understand why the Kepler architecture has such a performance disadvantage with my current code.
I did try some inline PTX assembly (looks horrid, check it out)
__device__ void ROTL7(uint32_t &A0, const uint32_t &A1, const uint32_t &A2,
uint32_t &B0, const uint32_t &B1, const uint32_t &B2,
uint32_t &C0, const uint32_t &C1, const uint32_t &C2,
uint32_t &D0, const uint32_t &D1, const uint32_t &D2)
{
asm("{\n\t"
" .reg .u32 tA1, tA2;\n\t"
" .reg .u32 tB1, tB2;\n\t"
" .reg .u32 tC1, tC2;\n\t"
" .reg .u32 tD1, tD2;\n\t"
" add.u32 tA1, %4, %5;\n\t"
" add.u32 tB1, %6, %7;\n\t"
" add.u32 tC1, %8, %9;\n\t"
" add.u32 tD1, %10, %11;\n\t"
" shl.b32 tA2, tA1, 7;\n\t"
" shl.b32 tB2, tB1, 7;\n\t"
" shl.b32 tC2, tC1, 7;\n\t"
" shl.b32 tD2, tD1, 7;\n\t"
" shr.b32 tA1, tA1, 25;\n\t"
" shr.b32 tB1, tB1, 25;\n\t"
" shr.b32 tC1, tC1, 25;\n\t"
" shr.b32 tD1, tD1, 25;\n\t"
" or.b32 tA1, tA1, tA2;\n\t"
" or.b32 tB1, tB1, tB2;\n\t"
" or.b32 tC1, tC1, tC2;\n\t"
" or.b32 tD1, tD1, tD2;\n\t"
" xor.b32 %0, %0, tA1;\n\t"
" xor.b32 %1, %1, tB1;\n\t"
" xor.b32 %2, %2, tC1;\n\t"
" xor.b32 %3, %3, tD1;\n\t"
"}"
: "+r"(A0), "+r"(B0), "+r"(C0), "+r"(D0) : "r" (A1), "r" (A2), "r" (B1), "r" (B2), "r" (C1), "r" (C2), "r" (D1), "r" (D2));
}
as well as added instruction level parallelism by formulating the CUDA code like this:
#define ROTL7(A0, A1, A2, B0, B1, B2, C0, C1, C2, D0, D1, D2) \
{\
volatile uint32_t tA1 = A1 + A2, tB1 = B1 + B2, tC1 = C1 + C2, tD1 = D1 + D2;\
volatile uint32_t tA2 = tA1<< 7, tB2 = tB1<< 7, tC2 = tC1<< 7, tD2 = tD1<< 7;\
tA1 = tA1>>25; tB1 = tB1>>25; tC1 = tC1>>25; tD1 = tD1>>25;\
tA2|= tA1 ; tB2|= tB1 ; tC2|= tC1 ; tD2|= tD1 ;\
A0 ^= tA2 ; B0 ^= tB2 ; C0 ^= tC2 ; D0 ^= tD2 ;\
}
but actually I couldn't get performance above what is already achieved. So in case you're wondering why there haven't been any updates. That is because my experiments in getting more speed haven't been fruitful yet.
Both Titans - 980Mhz Core, 6286Mhz Effective memory, 106% power target.
Both together ~500-530kH/s. Seems a Titan/K20 ought to do more in theory. Would love to see EACH at 500kH!
I really appreciate that this NVIDIA development may help increase LTC interest and adoption!