Bitcoin Forum
June 27, 2024, 07:39:14 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
  Home Help Search Login Register More  
  Show Posts
Pages: « 1 ... 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 [140] 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 »
2781  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 30, 2016, 09:30:54 PM
Try -march=native.
Probably the illegal instruction is executed on successful share submission, hence why the other pool doesn't hit it.
did that right after I saw this above problem would not complete with -march=native, got errors half way threw the compile. That was as far back as it would let me go since I had ran many other things since then.

New CPU's are here,  Grin so I'll be out the rest of the evening or at least until I get this thing back up and running.  Grin

Code:
algo/echo/aes_ni/vperm.h:107:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s1, M128(_k_aesmix4)));\
      ^
algo/echo/aes_ni/hash.c:130:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:algo/echo/aes_ni/vperm.h:107:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s1, M128(_k_aesmix4)));\
      ^
algo/echo/aes_ni/hash.c:130:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:53:5: error: called from here
  x  = _mm_shuffle_epi8(*((__m128i*)table + 0), x);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:52:5: error: called from here
  t1 = _mm_shuffle_epi8(*((__m128i*)table + 1), t1);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:68:5: error: called from here
  t2 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 1), x);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:70:5: error: called from here
  t3 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t1);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:72:5: error: called from here
  t4 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), x);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:74:5: error: called from here
  t2 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t3);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:76:5: error: called from here
  t3 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t4);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:84:4: error: called from here
  y = _mm_shuffle_epi8(*((__m128i*)table + 1), x2);\
    ^
algo/echo/aes_ni/vperm.h:101:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb1, s1, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:83:4: error: called from here
  t = _mm_shuffle_epi8(*((__m128i*)table + 0), x1);\
    ^
algo/echo/aes_ni/vperm.h:101:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb1, s1, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:84:4: error: called from here
  y = _mm_shuffle_epi8(*((__m128i*)table + 1), x2);\
    ^
algo/echo/aes_ni/vperm.h:102:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb2, s2, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:83:4: error: called from here
  t = _mm_shuffle_epi8(*((__m128i*)table + 0), x1);\
    ^
algo/echo/aes_ni/vperm.h:102:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb2, s2, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:105:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s3, M128(_k_aesmix2)));\
      ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);algo/echo/aes_ni/vperm.h:107:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s1, M128(_k_aesmix4)));\
      ^
algo/echo/aes_ni/hash.c:130:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:53:5: error: called from here
  x  = _mm_shuffle_epi8(*((__m128i*)table + 0), x);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:52:5: error: called from here
  t1 = _mm_shuffle_epi8(*((__m128i*)table + 1), t1);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:68:5: error: called from here
  t2 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 1), x);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:70:5: error: called from here
  t3 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t1);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:72:5: error: called from here
  t4 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), x);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:74:5: error: called from here
  t2 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t3);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:76:5: error: called from here
  t3 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t4);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:84:4: error: called from here
  y = _mm_shuffle_epi8(*((__m128i*)table + 1), x2);\
    ^
algo/echo/aes_ni/vperm.h:101:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb1, s1, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:83:4: error: called from here
  t = _mm_shuffle_epi8(*((__m128i*)table + 0), x1);\
    ^
algo/echo/aes_ni/vperm.h:101:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb1, s1, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:84:4: error: called from here
  y = _mm_shuffle_epi8(*((__m128i*)table + 1), x2);\
    ^
algo/echo/aes_ni/vperm.h:102:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb2, s2, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:83:4: error: called from here
  t = _mm_shuffle_epi8(*((__m128i*)table + 0), x1);\
    ^
algo/echo/aes_ni/vperm.h:102:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb2, s2, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:105:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s3, M128(_k_aesmix2)));\
      ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatchalgo/echo/aes_ni/vperm.h:107:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s1, M128(_k_aesmix4)));\
      ^
algo/echo/aes_ni/hash.c:130:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:53:5: error: called from here
  x  = _mm_shuffle_epi8(*((__m128i*)table + 0), x);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:52:5: error: called from here
  t1 = _mm_shuffle_epi8(*((__m128i*)table + 1), t1);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:68:5: error: called from here
  t2 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 1), x);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:70:5: error: called from here
  t3 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t1);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:72:5: error: called from here
  t4 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), x);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:74:5: error: called from here
  t2 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t3);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:76:5: error: called from here
  t3 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t4);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:84:4: error: called from here
  y = _mm_shuffle_epi8(*((__m128i*)table + 1), x2);\
    ^
algo/echo/aes_ni/vperm.h:101:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb1, s1, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:83:4: error: called from here
  t = _mm_shuffle_epi8(*((__m128i*)table + 0), x1);\
    ^
algo/echo/aes_ni/vperm.h:101:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb1, s1, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:84:4: error: called from here
  y = _mm_shuffle_epi8(*((__m128i*)table + 1), x2);\
    ^
algo/echo/aes_ni/vperm.h:102:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb2, s2, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:83:4: error: called from here
  t = _mm_shuffle_epi8(*((__m128i*)table + 0), x1);\
    ^
algo/echo/aes_ni/vperm.h:102:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb2, s2, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:105:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s3, M128(_k_aesmix2)));\
      ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:104:6: error: called from here
  x = _mm_shuffle_epi8(s2, M128(_k_aesmix1));\
      ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’algo/echo/aes_ni/vperm.h:107:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s1, M128(_k_aesmix4)));\
      ^
algo/echo/aes_ni/hash.c:130:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:53:5: error: called from here
  x  = _mm_shuffle_epi8(*((__m128i*)table + 0), x);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:52:5: error: called from here
  t1 = _mm_shuffle_epi8(*((__m128i*)table + 1), t1);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:68:5: error: called from here
  t2 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 1), x);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:70:5: error: called from here
  t3 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t1);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:72:5: error: called from here
  t4 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), x);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:74:5: error: called from here
  t2 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t3);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:76:5: error: called from here
  t3 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 0), t4);\
     ^
algo/echo/aes_ni/vperm.h:100:2: note: in expansion of macro ‘SUBSTITUTE_VPERM_CORE’
  SUBSTITUTE_VPERM_CORE(x, t1, t2, t3, t4);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:84:4: error: called from here
  y = _mm_shuffle_epi8(*((__m128i*)table + 1), x2);\
    ^
algo/echo/aes_ni/vperm.h:101:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb1, s1, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:83:4: error: called from here
  t = _mm_shuffle_epi8(*((__m128i*)table + 0), x1);\
    ^
algo/echo/aes_ni/vperm.h:101:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb1, s1, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:84:4: error: called from here
  y = _mm_shuffle_epi8(*((__m128i*)table + 1), x2);\
    ^
algo/echo/aes_ni/vperm.h:102:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb2, s2, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:83:4: error: called from here
  t = _mm_shuffle_epi8(*((__m128i*)table + 0), x1);\
    ^
algo/echo/aes_ni/vperm.h:102:2: note: in expansion of macro ‘VPERM_LOOKUP’
  VPERM_LOOKUP(t2, t3, _k_sb2, s2, t1);\
  ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:105:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s3, M128(_k_aesmix2)));\
      ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:336:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:104:6: error: called from here
  x = _mm_shuffle_epi8(s2, M128(_k_aesmix1));\
      ^
algo/echo/aes_ni/hash.c:126:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’algo/echo/aes_ni/vperm.h:107:6: error: called from here
  x = _mm_xor_si128(x, _mm_shuffle_epi8(s1, M128(_k_aesmix4)));\
      ^
algo/echo/aes_ni/hash.c:130:5: note: in expansion of macro ‘AES_ROUND_VPERM_CORE’
     AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:53:5: error: called from here
  x  = _mm_shuffle_epi8(*((__m128i*)table + 0), x);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:52:5: error: called from here
  t1 = _mm_shuffle_epi8(*((__m128i*)table + 1), t1);\
     ^
algo/echo/aes_ni/hash.c:134:5: note: in expansion of macro ‘TRANSFORM’
     TRANSFORM(s2, mul2ipt, t1, t2);\
     ^
algo/echo/aes_ni/hash.c:335:4: note: in expansion of macro ‘ECHO_SUB_AND_MIX’
    ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
    ^
In file included from algo/echo/aes_ni/vperm.h:20:0,
                 from algo/echo/aes_ni/hash.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h:136:1: error: inlining failed in call to always_inline ‘_mm_shuffle_epi8’: target specific option mismatch
 _mm_shuffle_epi8 (__m128i __X, __m128i __Y)
 ^
In file included from algo/echo/aes_ni/hash.c:21:0:
algo/echo/aes_ni/vperm.h:68:5: error: called from here
  t2 = _mm_shuffle_epi8(*((__m128i*)_k_inv + 1), x);\
     ^

Tell me you forgot NO_AES_NI, otherwise the problems just got a lot bigger.



Edit:I must have had a messed up environment. It compiled after a make distclean. Now I can look at your problem.
2782  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 30, 2016, 08:23:00 PM
Configure command line?
./configure CFLAGS="-O3 -march=btver1 -DNO_AES_NI" --with-crypto --with-curl

Code:
./configure CFLAGS="-O3 -march=btver1 -DNO_AES_NI" --with-crypto --with-curl
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... no
checking for mawk... mawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether to enable maintainer-specific portions of Makefiles... no
checking for style of include used by make... GNU
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking dependency style of gcc... gcc3
checking for gcc option to accept ISO C99... none needed
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking whether gcc needs -traditional... no
checking whether gcc and cc understand -c and -o together... yes
checking dependency style of gcc... gcc3
checking for ranlib... ranlib
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking sys/endian.h usability... no
checking sys/endian.h presence... no
checking for sys/endian.h... no
checking sys/param.h usability... yes
checking sys/param.h presence... yes
checking for sys/param.h... yes
checking syslog.h usability... yes
checking syslog.h presence... yes
checking for syslog.h... yes
checking for sys/sysctl.h... yes
checking whether be32dec is declared... no
checking whether le32dec is declared... no
checking whether be32enc is declared... no
checking whether le32enc is declared... no
checking whether le16dec is declared... no
checking whether le16enc is declared... no
checking for size_t... yes
checking for working alloca.h... yes
checking for alloca... yes
checking for getopt_long... yes
checking whether we can compile AVX code... yes
checking whether we can compile XOP code... yes
checking whether we can compile AVX2 code... yes
checking for json_loads in -ljansson... no
checking for pthread_create in -lpthread... yes
checking whether __uint128_t is supported... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating compat/Makefile
config.status: creating compat/jansson/Makefile
config.status: creating cpuminer-config.h
config.status: cpuminer-config.h is unchanged
config.status: executing depfiles commands

I was thinking the same thing.

The NO_AES_NI option should have prevented any incompatible code from being seen by the
compiler and the -march option should have told the compiler to build a compatible load.
The capability checks confirm both were attempted.

It's odd that you get a core dump on one pool but just rejects on the other.

I'll run a quick test on my cpu with -march=core2 and -DNO_AES_NI. Maybe you could try
core2 also. I haven't had much luck with specific architectures, native, core2 and corei7-avx
are all that have worked for me. I don't know if core2 will work on an AMD.

Edit: when compiling 3.0.7 with core2 and NO_AES_NI I'm getting errors in cryptonight AES_NI code
that is clearly supposed be be blocked. Have to dig deeper. I'm sure I tested this before release.
2783  Alternate cryptocurrencies / Mining (Altcoins) / ATTN all miner devs, new mining engine coming on: January 30, 2016, 05:40:22 PM
Greetings

I have recently forked cpuminer and have undertaken a redesign of the mining engine,
better known as miner_thread.

It should be compatible with all miners that use the Pooler/Garzik front end.

Follow progress in the cpuminer-opt thread.

https://bitcointalk.org/index.php?topic=1326803.msg13720909#msg13720909

This redesign may not be very useful to single algo miners, it biggest benefit
is the ability to add new algos quickly and easilly.
2784  Alternate cryptocurrencies / Mining (Altcoins) / ATTN all miner devs, new mining engine implementation coming on: January 30, 2016, 05:37:07 PM
Greetings

I have recently forked cpuminer and have undertaken a redesign of the mining engine,
better known as miner_thread.

It should be compatible with all miners that use the Pooler/Garzik front end.

Follow progress in the cpuminer-opt thread.

https://bitcointalk.org/index.php?topic=1326803.msg13720909#msg13720909
2785  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 30, 2016, 04:17:20 PM
How to use algo gate, for devs.

algo gate is the new miner_thread engine for running an algo in the correct environment.

It makes adding new algos easy. Here's how:

1. define the usual functions in the algo's source file. scanhash and hash functions are mandatory

2. Define a registration function

Code:
bool register_x11_algo( algo_gate_t* gate )
{
  gate->init_ctx               = &init_x11_aes_ctx;
  gate->scanhash            = &scanhash_x11_aes;
  gate->hash                    = &x11_hash_aes;
// gate->get_custom_opt = &x11_get_custom_opt;
// gate->get_max64         = &x11_get_max64;
  return true;
};

3. add "#include "algo-gate-api.h" to the algo's source file.

4. add a new entry to initialize_algo_gate in algo-gate-api.c to register your algo

Code:
     case ALGO_X11:
        register_x11_algo( &algo_gate );
        break;

5. Compile and run. When cpuminer starts it will use the algo registry
    without needing to know anything about the algo except it's name.

That's all there is to it. Any miner using the Garzick/Pooler front end can implement this.
The only pain is retrofitting all the existing algos.


More to come


2786  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 30, 2016, 03:06:32 PM
Wasn't asking for help with it was just bringing it to you attention in case you didn't know about it.
It and cpumimer-muti both run GPU algos with my limited testing with each on Intel & AMD CPU's in windows no VM or 2nd OS needed.

I did compile 3.07 on 2 Intel rigs & the AMD rig, still no luck with the AMD, only seeing a 20% load max no matter the core amount used.
Intel on the other hand runs about the same as 3.02 that was in use, 3.0.7 from a quick look was just a tad lower on hashrate than 3.0.2.
But that was just the start up numbers, long term that could change, will have to check it later once its been working a while.

I'm confused. Your first paragraph says you aren't asking for help and the AMD rig is working.
But the second one is saying you are having a problems with 3.0.7 and the AMD rig isn't working.

Is the problem with cpuminer-opt or the rig?

Thanks for the info abnout cpumier-multi or . It seems to claim SSE2 optimizations so how does it
compare with cpuminer-opt? Maybe they did a better job.

You're probably right about 3.0.2 performance. I'm reviewing the performance of all releases because

I think some of my tweaks actually hurt performance.

Thanks for the extensive testing.

AMD is working with cpuminer-muti or cpuminer_x64_SSE2 in windows, in VM your versions doesn't work for the AMD rig.
Yours loads now and appears to be working but at a low load rate "20% or less" and I never see any shares completed with the amount of time I've let it run.
Over 2 hours nothing.

OK so there is a problem with cpuminer-opt in a Linux VM on your AMD rig but the other miners work ok
natively in Windows on the same machine? Is this correct?

How does the performance compare between cpuminer-multi and cpuminer_x64_SSE2. It has SSE2 in
the name but does it have the performance? It should be significantly faster than than multi.

Edit and another thing you could try the other miners in the VM?This eliminates another variable. Thanks.
When did it break for cpuminer-opt?
#1 yes

#2 Haven't had much of a change to take a close look at those numbers yet.
RL is being a pain atm.

#3 When I get the time I'll be more then glad to, I did try or tried to compile cpuminer-multi  with no luck in the VM so far.
No source for cpuminer_x64_SSE2 to compile that I'm aware of.

#4 hasn't never worked for the AMD rig, but by this time next week with any luck I'll have my new CPU's up and running so it will no longer matter.
I will try to get you as much info as I can if you'd like that before the change over, but looks like I maybe the only one running this type of setup.
So that may not help anyone, but I willing to see what I can dig up.

No pressure, you can test whver you pleae or not at all. I'm very appreciative of everything you've done so far.

It appears the VM is the issue. I don't know why it woulg connect to stratum and get new work but never
start hashing. At least I know where to start looking.

I can test cpuminer_x64_SSE2 myself, it was just a curiosity.

Many thanks.
2787  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 30, 2016, 02:28:40 PM
Nice work! I like clean and modular code!

Thanks, I haven't yet found a way to run multiple entry functions upon program load..
The system I'm familiar with could declare any file with a main function to have it be run
automaticaaly when the program is loaded. Essentially it functions similarly to a OS init
system but for an application. If I can do that in c/c++ I could plug the only remaining
crack in the wall between algo code and base code. If I can't do it  from within the
algo file I might have to write an init system.
2788  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 30, 2016, 06:45:49 AM
The following may be of interest to devs of cpuminer, ccminer and maybe others.

Update:

More architectural changes made. Most declarations moved from cpu-miner.c
to miner.h so no files will need to include it. The system I am famiiar with could prevent
being embedded adn it was the default. The owner had to explicitly allow  inclusion
and to whom. Nice feature.

I still have to migrate full_test and get_opt to use the gate. I will have to split up the options
into common options and algo specific options. cpuminer-c will handler the common options
and the algos will handle the algo-specific options that apply to them.

I will also look into enhancing the registration process to have the algo and its custom
options added to the argument list and help file. These are the only remaining holes in the gate,
to my knowledge.

The gate is almost complete. It's called a gate because it is the only path between cpu-miner.c
and the algos. A very popular technique where I come from.

A developpers preview should be avilable in a couple of days.

I am hopeful the rearchitecting will help solve some of the other nagging issues I have been having,
like flipping between undefined and multi defined.

Previous post:
--------
I've done a lot of rebuilding under the hood because  miner_thread had become
polluted with too much algo specific code. Well, that is all gone.
Folllowing is a summary of the changes. I will make a preview available before release.

- restructured the file system to be more modular with respect to algos

- removed bloat added to support multiple CPU targets.

- removed all algo specific code from miner.h and cpu-miner.c, and eventually util.c
   and anywhere else I find it, it all lives now in the algo's source file.

- adding algos doesn't require any changes to base code, a registry interface file requires adding one entry.

- redesigned the scanhash engine to use function pointers instead of switch/case.

- removed many long switch/case statements.

- created a registry where algos register their functions with cpu-miner.

- registry functions include scanhash, hash, init_ctx, get_max64, and get_custom_opts with plans
  to add full_test and any others I find.

- enabled initialization of algo context outside of miner_thread loop.

- changes do not increase performance directly but enables more efficient management of contexts
  which helps certain algos, specificaly quark.
2789  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 30, 2016, 01:19:36 AM
Wasn't asking for help with it was just bringing it to you attention in case you didn't know about it.
It and cpumimer-muti both run GPU algos with my limited testing with each on Intel & AMD CPU's in windows no VM or 2nd OS needed.

I did compile 3.07 on 2 Intel rigs & the AMD rig, still no luck with the AMD, only seeing a 20% load max no matter the core amount used.
Intel on the other hand runs about the same as 3.02 that was in use, 3.0.7 from a quick look was just a tad lower on hashrate than 3.0.2.
But that was just the start up numbers, long term that could change, will have to check it later once its been working a while.

I'm confused. Your first paragraph says you aren't asking for help and the AMD rig is working.
But the second one is saying you are having a problems with 3.0.7 and the AMD rig isn't working.

Is the problem with cpuminer-opt or the rig?

Thanks for the info abnout cpumier-multi or . It seems to claim SSE2 optimizations so how does it
compare with cpuminer-opt? Maybe they did a better job.

You're probably right about 3.0.2 performance. I'm reviewing the performance of all releases because
I think some of my tweaks actually hurt performance.

Thanks for the extensive testing.

AMD is working with cpuminer-muti or cpuminer_x64_SSE2 in windows, in VM your versions doesn't work for the AMD rig.
Yours loads now and appears to be working but at a low load rate "20% or less" and I never see any shares completed with the amount of time I've let it run.
Over 2 hours nothing.

OK so there is a problem with cpuminer-opt in a Linux VM on your AMD rig but the other miners work ok
natively in Windows on the same machine? Is this correct?

How does the performance compare between cpuminer-multi and cpuminer_x64_SSE2. It has SSE2 in
the name but does it have the performance? It should be significantly faster than than multi.

Edit and another thing you could try the other miners in the VM?This eliminates another variable. Thanks.
When did it break for cpuminer-opt?
2790  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 29, 2016, 09:22:41 PM
Wasn't asking for help with it was just bringing it to you attention in case you didn't know about it.
It and cpumimer-muti both run GPU algos with my limited testing with each on Intel & AMD CPU's in windows no VM or 2nd OS needed.

I did compile 3.07 on 2 Intel rigs & the AMD rig, still no luck with the AMD, only seeing a 20% load max no matter the core amount used.
Intel on the other hand runs about the same as 3.02 that was in use, 3.0.7 from a quick look was just a tad lower on hashrate than 3.0.2.
But that was just the start up numbers, long term that could change, will have to check it later once its been working a while.

I'm confused. Your first paragraph says you aren't asking for help and the AMD rig is working.
But the second one is saying you are having a problems with 3.0.7 and the AMD rig isn't working.

Is the problem with cpuminer-opt or the rig?

Thanks for the info abnout cpumier-multi. It seems to claim SSE2 optimizations so how does it
compare with cpuminer-opt? Maybe they did a better job.

You're probably right about 3.0.2 performance. I'm reviewing the performance of all releases because
I think some of my tweaks actually hurt performance.

Thanks for the extensive testing.
2791  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 29, 2016, 07:00:06 PM

Not sure if there is a un-compiled version or Linux version of cpuminer_x64_SSE2, but the .exe of this
will mine x11 & a few other coins in Windows7-x64. No AES support, none of the cpuminer_x64_avx flavors
worked on any of my rigs, unsupported hardware? Using the cpuminer_x64_sse2 on the AMD rig ATM mining x11.
AMD rig has Server R2012 R2 as a OS. 42 day trail left on that. lol

Edit: On another note I have a old AMD FX duel CPU motherboard & hardware setting collecting dust if I can fine the time
from RL I'll setup a test bed with that "IF" I can find a spare PSU to use with it & a spot somewhere to place it. lol

I don't follow your first paragraph. Are you talking about another fork? What is cpuminer_x64_sse2?
Its found in the \NiceHashMiner_v1.2.2.2\bin folder after unpacked

runline: cpuminer_x64_SSE2 -a x11 -o stratum+tcp://x11.usa.nicehash.com:3336 -u 18tvS3deKZK5q4eTtPRWYeEMWmGmuErHgz.H8QG6 -p d=0.01 -t 32



Edit: Apparently is seems to be cpuminer-muti 1.1 renamed.? My guess is Nichhash software conf line makes it use sse2 when ran with it,
but with no options for any of the GPU algos to be mined on the CPU. Since I don't see sse2 being used in my run or maybe sse2 is used but just not printed as in use.

 

OK, that's not mine, you'll have to talk to nicehash about that.
2792  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 29, 2016, 04:30:56 PM

Not sure if there is a un-compiled version or Linux version of cpuminer_x64_SSE2, but the .exe of this
will mine x11 & a few other coins in Windows7-x64. No AES support, none of the cpuminer_x64_avx flavors
worked on any of my rigs, unsupported hardware? Using the cpuminer_x64_sse2 on the AMD rig ATM mining x11.
AMD rig has Server R2012 R2 as a OS. 42 day trail left on that. lol

Edit: On another note I have a old AMD FX duel CPU motherboard & hardware setting collecting dust if I can fine the time
from RL I'll setup a test bed with that "IF" I can find a spare PSU to use with it & a spot somewhere to place it. lol

I don't follow your first paragraph. Are you talking about another fork? What is cpuminer_x64_sse2?
2793  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 29, 2016, 10:12:54 AM
It didn't take long to hit a brick wall with windows.

static inline void transform(cubehashParm *sp )

Expected '(' to follow 'inline'

WTF?

 I guess that settles it.

visual studio or mingw?

VS. It worked for me for ccminer so I was hoping if I got the project file right it would work.
I was very careful with it matching the file list side by side with the makefile. There were
windows versions of all the miners I harvested from but I didn't compile any of them
myself.

I'll try mingw again, this time in a VM

Epsylon3 version of cpuminer compiles well with visual studio

I'll try compiling darkcoin minerd and cp3u. If they don't both compile then I know I have a bigger
problem.
2794  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 29, 2016, 09:50:25 AM
It didn't take long to hit a brick wall with windows.

static inline void transform(cubehashParm *sp )

Expected '(' to follow 'inline'

WTF?

 I guess that settles it.

visual studio or mingw?

VS. It worked for me for ccminer so I was hoping if I got the project file right it would work.
I was very careful with it matching the file list side by side with the makefile. There were
windows versions of all the miners I harvested from but I didn't compile any of them
myself.

I'll try mingw again, this time in a VM
2795  Alternate cryptocurrencies / Mining (Altcoins) / Re: CCminer(SP-MOD) Modded NVIDIA Maxwell kernels. on: January 29, 2016, 08:33:30 AM
You're not getting the real issue here - GDS is read ONCE. ONLY ONCE. In pretty much all X algos. Well, per kernel.
I'm not getting what your saying. It's not about repeated accesses to the same data it's about access to
different data in the same cache line only once. Preloading the cache line with the initial load instruction
means the subsequent data wil be available sooner.
Anyway SP doesn't seem interested and it's his thread so I should probably drop it.

X11 and quark only read memory linary. In my mod I use vector instructions in the gpu to load many 32bit words in one instruction.

if you compile this to ptx you will see what I meen.

#include "cuda_vector.h"
...
   uint32_t h[16];
   uint28 *phash = (uint28*)hash;
   uint28 *outpt = (uint28*)h;
   outpt[0] = phash[0];
   outpt[1] = phash[1];




That makes sense. I presume the size of the vector is the same as a cache line. That pretty much neutralizes
what I intended to accomplish. What I was proposing had two stages: fill the cache and load register from cache
with other instructions in between. If cuda does all that in one instruction I have to just wait. Got it now.

I'll look for some suitable code in cpuminer to try it on.
2796  Alternate cryptocurrencies / Mining (Altcoins) / Re: CCminer(SP-MOD) Modded NVIDIA Maxwell kernels. on: January 29, 2016, 07:38:02 AM

You're not getting the real issue here - GDS is read ONCE. ONLY ONCE. In pretty much all X algos. Well, per kernel.

I'm not getting what your saying. It's not about repeated accesses to the same data it's about access to
different data in the same cache line only once. Preloading the cache line with the initial load instruction
means the subsequent data wil be available sooner.

Anyway SP doesn't seem interested and it's his thread so I should probably drop it.
2797  Alternate cryptocurrencies / Mining (Altcoins) / Re: CCminer(SP-MOD) Modded NVIDIA Maxwell kernels. on: January 29, 2016, 06:25:38 AM
Here's the short answer assuming a 64 bit mem bus and a 32 byte cache line, ie 1 address cycle and
4 data cycles per burst to fill the cache line, a 4 deep mem queue and 2 instructions per clock.
An optimized memcpy in pseudo asm.

ld r0, src                        ; start loading 1st src cache line
ld r4, src+4                  ; start loading 2nd src cache line
preallocate dst cache      ;  intent to write so cache fill from mem not required, no wait
st r0, dst                        ; be ready as soon as first word arrives, stall here
ld r1, src+1                   ; load 2nd word of 1st line, cached now no wait
st r1, dst+1                   ; store it immediately, no stall
ld r2, src+2                   ; etc
st r2, src+2
ld r3, src+3
sr r3, src+3
flush src         ; flush the first source cache line unmodified, no writeback
flush dst         ; modified, writeback to mem, now to keep bus busy
st r4, src+4    ; by now the second cache line is filled, no wait
ld r0, src+5    : start filling 3rd cache line
finish saving second cache line etc.

This does not maximize double instruction issue because all instructions use the same execution unit.
The bus is kept busy after an initial wait for the first word, while you wait do anything else you can
that uses another execution unit like incrementing counters. Those instructions are essentially free.
If the function was modified to  process every word it would also be free. In fact the more processing you
do the more efficient it gets because you are using the alu more and all that while the mem bus is busy
doing its thing as fast as it can. If the mem IF is desiged properly there should be no problem with collisions.
It should always prioritize reads before writes.

If this model can be implemented on cuda we should see  some gains. I just don't have the cuda knowledge
to know if it can be done or how.
2798  Alternate cryptocurrencies / Mining (Altcoins) / Re: CCminer(SP-MOD) Modded NVIDIA Maxwell kernels. on: January 29, 2016, 04:40:11 AM
I was hoping to get a better response to my technical trolls but all I got was more bluster.
I was trying to find out if our skills were complementary. I am a complete noob when it comes
to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda.
When I provided a demonstration of my skills he respnded with sillly you that was cpu verification
code, and why don't you do better, without ever considering the technical merit or other
applications for the changes I made
He's more interested in selling what he has over and over again rather than providing anything new
that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture
https://github.com/NervanaSystems/maxas
Thanks, that will be useful when I learn how to use it. I'm looking for docs that describe the cuda
processir architecture in detail so I can dtetertmine things like how many loads to queue up to
fill the pipe, how many executions units, user cache management, etc. That kind of information
is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable
docs with this kind of info?

There is not much info available, but if you disassemble compiled code you will see that the maxwell is superscalar with 2 pipes. 2 instructions per cycle. It's able to execute instructions while writing to memory if the code is in the instruction cache. And you to avoid ALU stalls you need to reorder your instructions carefully.   There are vector instructions that can write bigger chunks of memory with fewer instructions... etc etc. The compiler is usually doing a good job here. Little to gain.. Ask DJM34 for more info. He is good in the random stuff...

Thanks again.

Have you tried interleaving memory accesses with arith instructions so they can be issued the same clock?
When copying mem do you issue the first load an the first store immediately after it. Thr first load fills the cache
line and the first store waits for the first bytes to become available. Then you can queue up enough loads to fill
the pipe and do other things while waiting for mem. Multi-buffering is a given being careful not to overuse regs.

If your doing a load, process, and store it's even better because you can have one instruction slot focussed on memory
while the other can do the processing.

These are things I'd like to try but haven't got the time. Although I've done similar in the past there was no performance
tests that could quantify the effect, good or bad.

If you think this has merit give it a shot. Like I said if it works just keep it open because I could still implement it myself.
The hotter the code segments you choose the bigger the result should be. Some of the assembly routines would be logical
targets.

GDS (global memory) LDS (local memory), and work-item shuffle all require a little waiting period before they complete. So, say I'm using ds_swizzle_b32 (work-item shuffle) like I had fun with in my 4-way Echo-512... On AMD GCN, you can do some shit like so:

Code:
# These shuffle in the correct state dwords from other work-items that are adjacent.
# This is done in place of BigShiftRows, but before BigMixColumns.
# So, my uint4 variables (in OpenCL notation) named b and c are now loaded properly without the need for shifting rows.
ds_swizzle_b32 v36, v80 offset:0x8039 # b.z
ds_swizzle_b32 v37, v81 offset:0x8039 # b.w
ds_swizzle_b32 v38, v78 offset:0x8039 # b.x
ds_swizzle_b32 v39, v79 offset:0x8039 # b.y
ds_swizzle_b32 v15, v84 offset:0x804E # c.z
ds_swizzle_b32 v16, v85 offset:0x804E # c.w
ds_swizzle_b32 v33, v82 offset:0x804E # c.x
ds_swizzle_b32 v34, v83 offset:0x804E # c.y

# Each and every one of these takes time, however - and each one increments a little counter.
# What I can do is this - since the first row in the state is not shifted, the a variable is already ready
# It's in registers and ready to be used.

# The first thing I do in the OpenCL after loading up the proper state values - in BigMixColumns - is a ^ b.
# So, I can do something like this:

s_waitcnt lgkmcnt(4)

# What this does is, it waits on the pending operations until there are four left.
# They're queued in the order the instructions were issued - so the b uint4 should now be loaded
# Note, however, that the c uint4 is NOT guaranteed to have been loaded, and cannot be relied on (yet.)
# Now, I can process the XOR while the swizzle operation on the c uint4 is working!

v_xor_b32 v42, v15, v36    # v42 = a.z ^ b.z
v_xor_b32 v43, v16, v37    # v43 = a.w ^ b.w
v_xor_b32 v38, v74, v38    # v38 = a.x ^ b.x
v_xor_b32 v39, v75, v39    # v39 = a.y ^ b.y

# And then we can put in an instruction to wait for the c uint4 before we continue...
s_waitcnt lgkmcnt(0)

In case you're wondering, I load the d uint4 later in the code. Also, if you *really* wanna try your damndest to maximize the time spent executing compute shit during loads, you could do this (although you've probably figured it out by now):

Code:
ds_swizzle_b32 v36, v80 offset:0x8039		# b.z
ds_swizzle_b32 v37, v81 offset:0x8039 # b.w
ds_swizzle_b32 v38, v78 offset:0x8039 # b.x
ds_swizzle_b32 v39, v79 offset:0x8039 # b.y
ds_swizzle_b32 v15, v84 offset:0x804E # c.z
ds_swizzle_b32 v16, v85 offset:0x804E # c.w
ds_swizzle_b32 v33, v82 offset:0x804E # c.x
ds_swizzle_b32 v34, v83 offset:0x804E # c.y

s_waitcnt lgkmcnt(7)
v_xor_b32 v42, v15, v36    # v42 = a.z ^ b.z
s_waitcnt lgkmcnt(6)
v_xor_b32 v43, v16, v37    # v43 = a.w ^ b.w
s_waitcnt lgkmcnt(5)
v_xor_b32 v38, v74, v38    # v38 = a.x ^ b.x
s_waitcnt lgkmcnt(4)
v_xor_b32 v39, v75, v39    # v39 = a.y ^ b.y

# You get the idea...
[code]
[/code]

I think I follow even though that syntax is completely foreign to me. I think what you did is what
I was talking about. But I would go one step farther. It may not apply because I don't understand
the wait instructions unless there are synchronization issues.

In addition to what you did I would put the first  xor on b immediately after the first load. I know
it's stalled waiting for data but I want its dependant instruction already queued for when the data
becomes available.

Secondly that first load will fill the cache line so there is no need to queue up the load instruction until
the first load completes. Susequent loads will finish immediately because they hit the cache:

What I would not do is have a string of indentical instruuctions because they all compete for the
same execution unit and can only be issued one per clock. I would interleave the swizzles and xors
to they can both be issued on the same clock, assuming all dependencies are met.


With comments:

ds_swizzle_b32 v36, v80 offset:0x8039      # b.z    // start filling the cache with b
v_xor_b32 v42, v15, v36    # v42 = a.z ^ b.z           // queue first xor for when b is ready
 ds_swizzle_b32 v37, v81 offset:0x8039      # b.w  // this will complete one clock after the previous swizzle so...
v_xor_b32 v43, v16, v37    # v43 = a.w ^ b.w                 // make sure we're ready for it

I think you get it. When all the B vars are loaded you can queue the C vars while still processing and
saving the first batch.

I would even go one step farther to the loading of a, if possible.

NO, NO, NO. The swizzle operation, like LDS and GDS loads, take TIME. Clock cycles. If you try and use the result without using s_waitcnt to be sure that the operation has completed, more than likely you'll be reading garbage. The likelihood of this occurring becomes greater and greater the closer your read instruction is to the load instruction that must be waited on. Or, more accurately, how few clock cycles have passed.

The uint4 I named a is already in registers - if you wanna walk it back, it's actually from an AES operation before, which may be done via LDS lookups into a table and XORs, or a bitsliced AES S-box followed by an otherwise mostly classic-style AES implementation.

I think your misunderstanding is that you think v_xor_b32 queues something. It doesn't. ds_* instructions you might be able to say "queue" something, in the sense that they trigger an LDS read/write and immediately allow for the next instruction to be executed. v_xor_b32 is an immediate XOR of two registers. It couldn't give a fuck less what's in them, or what you meant to put in them - it's going to XOR them and put the result into the destination register, and if it's not what you intended it to be, that's your problem.

I would start with swizzle a immediately followed by swizzle b then the first xor.
There wil be a lot of stalling waiting for memory here so if there are any other trivial tasks
do them next.

Loading a & b in parallel may seem odd but once both are in the cache you're flying. Then you can
mix saving processed data and loading new data, giving priority to loads to keep the GPU hot and
you can stick in the first swizzle c early to get the data ready.

I learned some of this stuff on a company paid Motorola course. The instructor was a geek and our class
was pretty sharp so we covered the material eraly then started having fun. At the time we were in a
performance cruch with customers demanding more capacity so we focsussed on code scheduling and user
cache management. One of the more bizarre instructions was the delayed branch. It exssentially means
branch AFTER the next instruction. That next instruction was often returning the rc. It took some getting
used to but oit gives an idea of the level of optimization they were into at the time.

It's the same CPU that had the ability to mark a cache line valid without touching mem. It great for
malloc because the data is initially undefined anyway. Who cares whether the garbage comes from
mem or stale cache, it's all garbage. Imagine mallocing 1k and having it cached without ever touching
the bus. They also have an instruction to preload the cache for real. that is essentially what I was
simulating above. It also had a user flush so you could flush data at any convenient time after you
no longer needed it instead of a system initiated flush when you are stalled waiting for new data.


Keep in mind - there is no swizzle for the uint4 named a - the first row is not shifted. This is why you don't see any swizzle ops for it - it is entirely contained within the single work-item. This is why I swizzle b, then c, and then begin my XORs. Keep in mind, again, that this triggers the start of the swizzle and immediately goes to the next instruction - this means if I do b AND c one after another, and only wait on b in order to XOR it with a, I'm putting more clock cycles between the time I initiated the swizzle for c, and the time I need it to complete. It's entirely possible that by the time I call on s_waitcnt to ensure the c variables are ready, they already are and the instruction takes basically no time at all.

You're also thinking about cache, which doesn't apply here at all - swizzle is a 4-way crossbar that allows transfer of values between work-items on a compute unit. In addition, even if it wasn't, I couldn't give a fuck less if it's in the cache - hell, I'd rather it NOT be. Why? Simply because I just loaded those values into registers, and were they in memory, I would never be reading them from memory again. X11 and friends can be done using ZERO global memory at all (besides getting work, storing the state for the next kernel, and output, of course) - if you work at it, it can even be done without using LDS for storage of shit like AES tables. Now, you *may* want to use LDS for other reasons to create an optimal GPU implementation, but these are related more to parallelizing the hash functions more, by unrolling them across multiple WIs (like this Echo-512 we're discussing), rather than actual storage of data that's honestly needed to compute the hash function. Because of this, cache is really irrelevant, and in an extremely well optimized X11 kernel set, you should be able to downclock memory to hell and have it not matter one iota.

Fun fact: This is why the claim of X11 being "ASIC-resistant" is more or less a flat out lie. What most people call "ASIC-resistant" is actually "ASIC-unprofitable" - meaning that the ASIC would cost so much that its advantage over the currently used mining hardware doesn't justify making it. Usually, this is done via memory usage, at least for now. But X11 isn't memory-hard - shit, it doesn't really need memory at all, especially if implemented in hardware.

Perhaps your example was not well chosen, too many new concepts for me. Try to think of in  the general sense where
data is loaded from mem some processing done and stored back in mem.  I usually see a string of 4 or 8 loads followed
by a similiar string of xors or adds or whatever and then a string of stores. This is ok in the sense it uses multi buffering but
is inneficient because it can't take advantage of  multiple instruction issue. It's all serial.

There also no need to rush the second load because the first one will get the cache
filled (assuming there is cache). And user cache management doesn't depend on whether the application caches well.
It's useful because the coder can manage the cache to overcome the apps defficiency.

Need some data soon but have other things to do first?  preload it so it's ready when you are.

Done with a buffer? flush the cache line to get rid of the data and free up the bus for the next data you need.

It's all about managing the data and planning when you need it and how to have it when you need it so you don't
have to wait as long. With mem being the bottleneck you want to prioritize managing mem accesses to reduce
latency. You don't want the bus sitting idle while you do a shitload of alu stuff just to have to wait when you ask
for more data.

When I mentioned queing the xor i didn't mean it literally. I just meant have it ready to be issued as soon as the
data arrives.
2799  Alternate cryptocurrencies / Mining (Altcoins) / Re: [ANN] New Improved altcoin CPU miner with support for AES-NI on: January 29, 2016, 03:20:33 AM
It didn't take long to hit a brick wall with windows.

static inline void transform(cubehashParm *sp )

Expected '(' to follow 'inline'

WTF?

 I guess that settles it.
2800  Alternate cryptocurrencies / Mining (Altcoins) / Re: CCminer(SP-MOD) Modded NVIDIA Maxwell kernels. on: January 29, 2016, 03:13:10 AM
I was hoping to get a better response to my technical trolls but all I got was more bluster.
I was trying to find out if our skills were complementary. I am a complete noob when it comes
to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda.
When I provided a demonstration of my skills he respnded with sillly you that was cpu verification
code, and why don't you do better, without ever considering the technical merit or other
applications for the changes I made
He's more interested in selling what he has over and over again rather than providing anything new
that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture
https://github.com/NervanaSystems/maxas
Thanks, that will be useful when I learn how to use it. I'm looking for docs that describe the cuda
processir architecture in detail so I can dtetertmine things like how many loads to queue up to
fill the pipe, how many executions units, user cache management, etc. That kind of information
is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable
docs with this kind of info?

There is not much info available, but if you disassemble compiled code you will see that the maxwell is superscalar with 2 pipes. 2 instructions per cycle. It's able to execute instructions while writing to memory if the code is in the instruction cache. And you to avoid ALU stalls you need to reorder your instructions carefully.   There are vector instructions that can write bigger chunks of memory with fewer instructions... etc etc. The compiler is usually doing a good job here. Little to gain.. Ask DJM34 for more info. He is good in the random stuff...

Thanks again.

Have you tried interleaving memory accesses with arith instructions so they can be issued the same clock?
When copying mem do you issue the first load an the first store immediately after it. Thr first load fills the cache
line and the first store waits for the first bytes to become available. Then you can queue up enough loads to fill
the pipe and do other things while waiting for mem. Multi-buffering is a given being careful not to overuse regs.

If your doing a load, process, and store it's even better because you can have one instruction slot focussed on memory
while the other can do the processing.

These are things I'd like to try but haven't got the time. Although I've done similar in the past there was no performance
tests that could quantify the effect, good or bad.

If you think this has merit give it a shot. Like I said if it works just keep it open because I could still implement it myself.
The hotter the code segments you choose the bigger the result should be. Some of the assembly routines would be logical
targets.

GDS (global memory) LDS (local memory), and work-item shuffle all require a little waiting period before they complete. So, say I'm using ds_swizzle_b32 (work-item shuffle) like I had fun with in my 4-way Echo-512... On AMD GCN, you can do some shit like so:

Code:
# These shuffle in the correct state dwords from other work-items that are adjacent.
# This is done in place of BigShiftRows, but before BigMixColumns.
# So, my uint4 variables (in OpenCL notation) named b and c are now loaded properly without the need for shifting rows.
ds_swizzle_b32 v36, v80 offset:0x8039 # b.z
ds_swizzle_b32 v37, v81 offset:0x8039 # b.w
ds_swizzle_b32 v38, v78 offset:0x8039 # b.x
ds_swizzle_b32 v39, v79 offset:0x8039 # b.y
ds_swizzle_b32 v15, v84 offset:0x804E # c.z
ds_swizzle_b32 v16, v85 offset:0x804E # c.w
ds_swizzle_b32 v33, v82 offset:0x804E # c.x
ds_swizzle_b32 v34, v83 offset:0x804E # c.y

# Each and every one of these takes time, however - and each one increments a little counter.
# What I can do is this - since the first row in the state is not shifted, the a variable is already ready
# It's in registers and ready to be used.

# The first thing I do in the OpenCL after loading up the proper state values - in BigMixColumns - is a ^ b.
# So, I can do something like this:

s_waitcnt lgkmcnt(4)

# What this does is, it waits on the pending operations until there are four left.
# They're queued in the order the instructions were issued - so the b uint4 should now be loaded
# Note, however, that the c uint4 is NOT guaranteed to have been loaded, and cannot be relied on (yet.)
# Now, I can process the XOR while the swizzle operation on the c uint4 is working!

v_xor_b32 v42, v15, v36    # v42 = a.z ^ b.z
v_xor_b32 v43, v16, v37    # v43 = a.w ^ b.w
v_xor_b32 v38, v74, v38    # v38 = a.x ^ b.x
v_xor_b32 v39, v75, v39    # v39 = a.y ^ b.y

# And then we can put in an instruction to wait for the c uint4 before we continue...
s_waitcnt lgkmcnt(0)

In case you're wondering, I load the d uint4 later in the code. Also, if you *really* wanna try your damndest to maximize the time spent executing compute shit during loads, you could do this (although you've probably figured it out by now):

Code:
ds_swizzle_b32 v36, v80 offset:0x8039		# b.z
ds_swizzle_b32 v37, v81 offset:0x8039 # b.w
ds_swizzle_b32 v38, v78 offset:0x8039 # b.x
ds_swizzle_b32 v39, v79 offset:0x8039 # b.y
ds_swizzle_b32 v15, v84 offset:0x804E # c.z
ds_swizzle_b32 v16, v85 offset:0x804E # c.w
ds_swizzle_b32 v33, v82 offset:0x804E # c.x
ds_swizzle_b32 v34, v83 offset:0x804E # c.y

s_waitcnt lgkmcnt(7)
v_xor_b32 v42, v15, v36    # v42 = a.z ^ b.z
s_waitcnt lgkmcnt(6)
v_xor_b32 v43, v16, v37    # v43 = a.w ^ b.w
s_waitcnt lgkmcnt(5)
v_xor_b32 v38, v74, v38    # v38 = a.x ^ b.x
s_waitcnt lgkmcnt(4)
v_xor_b32 v39, v75, v39    # v39 = a.y ^ b.y

# You get the idea...
[code]
[/code]

I think I follow even though that syntax is completely foreign to me. I think what you did is what
I was talking about. But I would go one step farther. It may not apply because I don't understand
the wait instructions unless there are synchronization issues.

In addition to what you did I would put the first  xor on b immediately after the first load. I know
it's stalled waiting for data but I want its dependant instruction already queued for when the data
becomes available.

Secondly that first load will fill the cache line so there is no need to queue up the load instruction until
the first load completes. Susequent loads will finish immediately because they hit the cache:

What I would not do is have a string of indentical instruuctions because they all compete for the
same execution unit and can only be issued one per clock. I would interleave the swizzles and xors
to they can both be issued on the same clock, assuming all dependencies are met.


With comments:

ds_swizzle_b32 v36, v80 offset:0x8039      # b.z    // start filling the cache with b
v_xor_b32 v42, v15, v36    # v42 = a.z ^ b.z           // queue first xor for when b is ready
 ds_swizzle_b32 v37, v81 offset:0x8039      # b.w  // this will complete one clock after the previous swizzle so...
v_xor_b32 v43, v16, v37    # v43 = a.w ^ b.w                 // make sure we're ready for it

I think you get it. When all the B vars are loaded you can queue the C vars while still processing and
saving the first batch.

I would even go one step farther to the loading of a, if possible.

I would start with swizzle a immediately followed by swizzle b then the first xor.
There wil be a lot of stalling waiting for memory here so if there are any other trivial tasks
do them next.

Loading a & b in parallel may seem odd but once both are in the cache you're flying. Then you can
mix saving processed data and loading new data, giving priority to loads to keep the GPU hot and
you can stick in the first swizzle c early to get the data ready.

I learned some of this stuff on a company paid Motorola course. The instructor was a geek and our class
was pretty sharp so we covered the material eraly then started having fun. At the time we were in a
performance cruch with customers demanding more capacity so we focsussed on code scheduling and user
cache management. One of the more bizarre instructions was the delayed branch. It exssentially means
branch AFTER the next instruction. That next instruction was often returning the rc. It took some getting
used to but oit gives an idea of the level of optimization they were into at the time.

It's the same CPU that had the ability to mark a cache line valid without touching mem. It great for
malloc because the data is initially undefined anyway. Who cares whether the garbage comes from
mem or stale cache, it's all garbage. Imagine mallocing 1k and having it cached without ever touching
the bus. They also have an instruction to preload the cache for real. that is essentially what I was
simulating above. It also had a user flush so you could flush data at any convenient time after you
no longer needed it instead of a system initiated flush when you are stalled waiting for new data.
Pages: « 1 ... 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 [140] 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 »
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!