Today I worked on x13 (fugue)
x0 = ((c0 ^ r0) & SPH_C32(0xFF000000)) \
| ((c1 ^ r1) & SPH_C32(0x00FF0000)) \
| ((c2 ^ r2) & SPH_C32(0x0000FF00)) \
| ((c3 ^ r3) & SPH_C32(0x000000FF)); \
x1 = ((c1 ^ (r0 <<
) & SPH_C32(0xFF000000)) \
| ((c2 ^ (r1 <<
) & SPH_C32(0x00FF0000)) \
| ((c3 ^ (r2 <<
) & SPH_C32(0x0000FF00)) \
| ((c0 ^ (r3 >> 24)) & SPH_C32(0x000000FF)); \
x2 = ((c2 ^ (r0 << 16)) & SPH_C32(0xFF000000)) \
| ((c3 ^ (r1 << 16)) & SPH_C32(0x00FF0000)) \
| ((c0 ^ (r2 >> 16)) & SPH_C32(0x0000FF00)) \
| ((c1 ^ (r3 >> 16)) & SPH_C32(0x000000FF)); \
x3 = ((c3 ^ (r0 << 24)) & SPH_C32(0xFF000000)) \
| ((c0 ^ (r1 >>
) & SPH_C32(0x00FF0000)) \
| ((c1 ^ (r2 >>
) & SPH_C32(0x0000FF00)) \
| ((c2 ^ (r3 >>
) & SPH_C32(0x000000FF)); \
Replaced with:
t0 = __byte_perm(c0, c1, 0x0145);\
t1 = __byte_perm(c0, c1, 0x2367);\
t2 = __byte_perm(c2, c3, 0x0145);\
t3 = __byte_perm(c2, c3, 0x2367);\
t4 = __byte_perm(t0, t3, 0x0347);\
t6 = __byte_perm(t1, t2, 0x4703);\
t7 = __byte_perm(c1, c2, 0x0505);\
t8 = __byte_perm(c0, c3, 0x6363);\
t9 = __byte_perm(t7, t8, 0x0145);\
t10 = __byte_perm(c0, c3, 0x4141);\
t11 = __byte_perm(c1, c2, 0x1717);\
t12 = __byte_perm(t10, t11, 0x0145);\
t13 = __byte_perm(r0, r1, 0x0505);\
t14 = __byte_perm(r2, r3, 0x3737);\
t15 = __byte_perm(t13, t14, 0x0145);\
t16 = __byte_perm(r0, r1, 0x1616);\
t17 = __byte_perm(r2, r3, 0x3434);\
t18 = __byte_perm(t16, t17, 0x0145);\
t19 = __byte_perm(r0, r1, 0x2727);\
t20 = __byte_perm(r2, r3, 0x0505);\
t21 = __byte_perm(t19, t20, 0x0145);\
t22 = __byte_perm(r0, r1, 0x3434);\
t23 = __byte_perm(r0, r2, 0x5151);\
t24 = __byte_perm(t22, t23, 0x0145);\
x0 = t4^t15;\
x1 = t9^t18;\
x2 = t6^t21;\
x3 = t12^t24;\
There is a bug somewhere in the perms so the code doesn't hash correctly. PTX code shows half the assembly instructions used, but the speed is the same/perhaps a little bit faster. seems to be a dead end. :/