Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 16, 2016, 06:30:48 PM |
|
Sorry, you're perfectly right! My mistake ... its been a 16 hour working day already
|
|
|
|
coralreefer
|
|
October 16, 2016, 06:32:29 PM |
|
We also have a slight problem in the "order" the expressions are performed. In Java they are processed in a different order. This little program: Mangles in this order: 6, 12, 12, 1, 1 When the code is compiled into C, this order applies: 12, 6, 12, 1, 1 The C function by the way is the following, when m() describes the mangle_function: mem[m(6)] = m(m((4) * (3))); return m((m(((1) == (1))?1:0))!=0?1:0);
Actually, I would expect that in C m(6) will be the first operation to be executed ... NO! Unfortunately not! So we have to rethink the mangle function in the C-Compiler after all.I agree that m[6] should have run first. The interpreter should start with the lowest left node, then the right one, then moves up, so I'm not clear why the order is wrong. = / \ m[6] * / \ 3 4 so, I would expect the order to be m[6], 3, 4, *, =
|
|
|
|
coralreefer
|
|
October 16, 2016, 06:36:40 PM |
|
Maybe a simple fix would be to run the mangle_state after each statement instead of each operator. This would be easy to incorporate into the C parser...how hard would it be to do on the java side.
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 16, 2016, 06:37:22 PM |
|
I fixed the order of mangle in the Java interpreter. Everything is fine now! m[6]=4*(5>>>(m[3]*2+2)); verify 1==1;
beavis@methusalem ~/Development/elastic-pl (git)-[master] % java -jar ElasticToCCompiler.jar fuck.spl; gcc -O0 fuck.spl.c; ./a.out Elastic-to-C Compiler Version 0.1: Reading from file fuck.spl . . . Elastic-to-C Compiler Version 0.1: OK. MANGLE 6 MANGLE 0 MANGLE 2 MANGLE 1073741825 MANGLE 4 MANGLE 4 MANGLE 1 MANGLE 1 MANGLE STATE: 6724 -4611686018158952447.
beavis@methusalem ~/Development/elastic-pl (git)-[master] % java -jar ElasticPL.jar fuck.spl Elastic Programming Language Interpreter Version 0.1: Reading from file fuck.spl . . . [!] Stack usage exceeded: false [!] AST depth: 8 [!] Worst case execution time: 11 MANGLE: 6 MANGLE: 0 MANGLE: 2 MANGLE: 1073741825 MANGLE: 4 MANGLE: 4 m[6]: 4 [!] Exit Stack Pointer: 0 MANGLE: 1 MANGLE: 1 [!] Mangle State: 6724 -4611686018158952447 [!] Bounty requirement met: true beavis@methusalem ~/Development/elastic-pl (git)-[master] %
For testing purpose, the full C code as it was generated by the current ElasticToCCompiler.jar #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <limits.h> #include <time.h>
int32_t mem[64000]; uint64_t vm_state1 = 0; uint64_t vm_state2 = 0;
uint64_t rotl64 (uint64_t x, unsigned int n) { const unsigned int mask = (CHAR_BIT*sizeof(x)-1); n &= mask; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers return (x<<n) | (x>>( (-n)&mask )); } uint64_t rotr64 (uint64_t x, unsigned int n) { const unsigned int mask = (CHAR_BIT*sizeof(x)-1); n &= mask; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers return (x>>n) | (x<<( (-n)&mask )); } uint32_t rotl32 (uint32_t x, unsigned int n) { const unsigned int mask = (CHAR_BIT*sizeof(x)-1); n &= mask; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers return (x<<n) | (x>>( (-n)&mask )); } uint32_t rotr32 (uint32_t x, unsigned int n) { const unsigned int mask = (CHAR_BIT*sizeof(x)-1); n &= mask; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers return (x>>n) | (x<<( (-n)&mask )); } int m(int x) { printf("MANGLE %d\n",x); int mod = x % 64; if (x % 2 == 0) { vm_state1 = rotl64(vm_state1, mod); vm_state1 = vm_state1 ^ x; } else { vm_state2 = rotr64(vm_state2, mod); vm_state2 = vm_state2 ^ x; } return x; }
int execute(); int main(){ execute(); printf("MANGLE STATE: %lld %lld.\n",vm_state1,vm_state2); }
int execute(){ vm_state1=0; vm_state2=0; mem[m(6)] = m(m((4) * (m(rotr32((5),(m((m((mem[3]) * (2))) + (2)))%32))))); return m((m(((1) == (1))?1:0))!=0?1:0); }
|
|
|
|
coralreefer
|
|
October 16, 2016, 11:04:22 PM |
|
EK, I created a DLL for the code above and called it instead of the internal interpreter. With POW calcs running I got 160kEval/s and when I turned POW off I got 400kEval/s. I'm sure you're looking for quite a bit more, but at least its a step in the right direction.
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 16, 2016, 11:09:45 PM |
|
EK, I created a DLL for the code above and called it instead of the internal interpreter. With POW calcs running I got 160kEval/s and when I turned POW off I got 400kEval/s. I'm sure you're looking for quite a bit more, but at least its a step in the right direction.
First of all ... awesome work! The speed, however, is strange ;-) I am on a Thinkpad T460p (I admit it has the i7 quadcore and 32GB ram, but it's still just a notebook) and I am executing that code with aroung 6 million evals per second. Do you have something I could quickly compile on my linux machine ... maybe the VC compiler is a bit inefficient? Or the DLL calling is too costly ... maybe a static library is better here? I would quickly check. EDIT: I could get the cachegrind profiler up and running quickly to tell you where the bottleneck lies.
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 16, 2016, 11:18:55 PM |
|
By the way, coralreefer: you are the author of that part of Elastic, that makes the whole thing efficient in the first place! Thank you very much for all your hard work so far! I am sure Lannister will be grateful as well, won't he? Amazing stuff!
|
|
|
|
coralreefer
|
|
October 16, 2016, 11:49:23 PM |
|
Thanks EK. I don't have anything to send right now as I'm not familiar w/ working with libraries outside of VC, and what I did for this test within VC is a complete hack.
However, I commented out everything in scanhash except the running of the code in the DLL and I saw 4.5 MEval/s. So I don't think its the linked library. It looks like the randomizing of the inputs and the POW calcs still have a pretty big penalty. But that can get cleaned up over time...it seems like the approach is still viable.
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 16, 2016, 11:56:06 PM |
|
Thanks EK. I don't have anything to send right now as I'm not familiar w/ working with libraries outside of VC, and what I did for this test within VC is a complete hack.
However, I commented out everything in scanhash except the running of the code in the DLL and I saw 4.5 MEval/s. So I don't think its the linked library. It looks like the randomizing of the inputs and the POW calcs still have a pretty big penalty. But that can get cleaned up over time...it seems like the approach is still viable.
We could change the randomize-input routine to take the multiplicator as the last argument. This would be trivial but allow for the calculation of a "midstate" which is retained over all iterations. Every iteration would just have to finish the "remainder" of the MD5 hash calculation, beginning at the midstate, to account for the unique multiplier. Not sure how much we save with this? We need to check a profiler ... sometimes the bottleneck is where you would least expect it. At least, when I run a SHA256 based bitcoin miner on my cmputer I get roughly 10M hashes per second. This example here is not significantly "more complicated". So there is some room yet to be exploited ;-)
|
|
|
|
coralreefer
|
|
October 17, 2016, 12:02:47 AM |
|
We could change the randomize-input routine to take the multiplicator as the last argument. This would be trivial but allow for the calculation of a "midstate" which is retained over all iterations. Every iteration would just have to finish the "remainder" of the MD5 hash calculation, beginning at the midstate, to account for the unique multiplier.
I think that should help...also because we only need to get random ints far less frequently, and I'm sure the algo I have there is slow. But the built in random function is worthless.
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 17, 2016, 07:51:48 AM Last edit: October 17, 2016, 09:24:32 AM by Evil-Knievel |
|
We could change the randomize-input routine to take the multiplicator as the last argument. This would be trivial but allow for the calculation of a "midstate" which is retained over all iterations. Every iteration would just have to finish the "remainder" of the MD5 hash calculation, beginning at the midstate, to account for the unique multiplier.
I think that should help...also because we only need to get random ints far less frequently, and I'm sure the algo I have there is slow. But the built in random function is worthless. All the following can be found here in this branch: https://github.com/OrdinaryDude/xel_miner/tree/fast-optimization. xel_miner.c has some TODO FIXME's, where I changed the logic to not submit any PoW or Bounties (as they are found too fast). Note: the meaning of rc=1 and rc=2 has been swapped
Testing can be done with: ./xel_miner -k 19fafc1fa028af61d4bb603e1f9f06eca0bea765114115ad503b532588fbc83d --test-miner work.json --threads 1
(Note, it does not yet do anything useful ... its just hacky with statically compiled work package ... no on the fly compiling yet) I have managed to boost the speed a lot with the linked library approach: [09:49:30] DEBUG: Running ElasticPL Parser [09:49:35] CPU0: 784.88 kEval/s
We are at almost 800000 Evals per second
All on one thread WITH pseudorandomInts and PoW SHA256 check ... 2 or more threads slow everything down! Optimizations:Most importantly, as the profiler showed me, gen_rand_32 is the no.1 bottleneck! It does now just increment the "multiplicator" instead of filling the last two ints with random input in every iteration. It just does that once, now!
like this: static int scanhash(int thr_id, struct work *work, long *hashes_done) { ... mult32[6] = genrand_int32(); mult32[7] = genrand_int32(); while (1) { // Check If New Work Is Available if (work_restart[thr_id].restart) { applog(LOG_DEBUG, "CPU%d: New work detected", thr_id); return 0; }
// Increment mult32 mult32[7]=mult32[7]+1; if(mult32[7] == INT32_MAX){ mult32[7] = 0; mult32[6] = mult32[6] + 1; }... ...then... C Flags: -Ofast -msse -msse2 -msse3 -mmmx -m3dnow -fext-numeric-literals Mangle State now works on 32 bit integers only, avoiding the costly 64bit rotation: uint32_t vm_state1 = 0; uint32_t vm_state2 = 0; uint32_t vm_state3 = 0; uint32_t vm_state4 = 0; static const unsigned int mask32 = (CHAR_BIT*sizeof(uint32_t)-1); static inline uint32_t rotl32 (uint32_t x, unsigned int n) { n &= mask32; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers return (x<<n) | (x>>( (-n)&mask32 )); } static inline uint32_t rotr32 (uint32_t x, unsigned int n) { n &= mask32; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers return (x>>n) | (x<<( (-n)&mask32 )); } static int m(int x) { int mod = x % 64; int leaf = mod % 4; if (leaf == 0) { vm_state1 = rotl32(vm_state1, mod); vm_state1 = vm_state1 ^ x; } else if (leaf == 1) { vm_state2 = rotl32(vm_state2, mod); vm_state2 = vm_state2 ^ x; } else if (leaf == 2) { vm_state3 = rotl32(vm_state3, mod); vm_state3 = vm_state3 ^ x; } else { vm_state4 = rotr32(vm_state4, mod); vm_state4 = vm_state4 ^ x; } return x; } Fill Integers uses memset to clear the memory int fill_ints(int input[]){ memset((char*)mem, 0, 64000*sizeof(char)); for(int i=0;i<12;++i) mem[i] = input[i]; vm_state1=0; vm_state2=0; } The complete "linked" program looks like this: Fill ints is called from xel_miner.c, and the vm_state's are accessed directly to verify POW. So does mem[] in case of a bounty. #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <limits.h> #include <time.h> #include "miner.h"
int32_t mem[64000];
uint32_t vm_state1 = 0; uint32_t vm_state2 = 0; uint32_t vm_state3 = 0; uint32_t vm_state4 = 0; static const unsigned int mask32 = (CHAR_BIT*sizeof(uint32_t)-1); static inline uint32_t rotl32 (uint32_t x, unsigned int n) { n &= mask32; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers return (x<<n) | (x>>( (-n)&mask32 )); } static inline uint32_t rotr32 (uint32_t x, unsigned int n) { n &= mask32; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers return (x>>n) | (x<<( (-n)&mask32 )); } static int m(int x) { int mod = x % 64; int leaf = mod % 4; if (leaf == 0) { vm_state1 = rotl32(vm_state1, mod); vm_state1 = vm_state1 ^ x; } else if (leaf == 1) { vm_state2 = rotl32(vm_state2, mod); vm_state2 = vm_state2 ^ x; } else if (leaf == 2) { vm_state3 = rotl32(vm_state3, mod); vm_state3 = vm_state3 ^ x; } else { vm_state4 = rotr32(vm_state4, mod); vm_state4 = vm_state4 ^ x; } return x; }
int fill_ints(int input[]){ memset((char*)mem, 0, 64000*sizeof(char)); for(int i=0;i<12;++i) mem[i] = input[i]; vm_state1=0; vm_state2=0; vm_state3=0; vm_state4=0; }
int execute(){
mem[m(6)] = m(m((7) ^ (m((4) * (m(rotr32((5),(m((m((mem[3]) * (2))) + (2)))%32))))))); mem[m(2)] = m(m((m(((mem[2]) == (mem[m((1) + (1))]))?1:0)) << (6))); mem[m(1)] = m(m(((mem[2]) != 0)?((mem[1]) * (mem[2])):0)); mem[m(2)] = m(m((mem[1]) - (mem[0]))); mem[m(3)] = m(m(rotl32((m(((mem[0]) != 0)?((mem[1]) % (mem[0])):0)),(m((m(((1) > (0))?1:0)) + (3)))%32))); return m((m(((mem[0]) == (m((0) - (mem[2]))))?1:0))!=0?1:0); }
Now we are left with the memset inefficiency! The rest is looking fine: EDIT!I am even faster now with 1 Million executions per second by dropping memset() and using aligned_alloc once! I see no point in nulling the memory at all! #ifdef _WIN32
#define ALLOC_ALIGNED_BUFFER(_numBytes) ((int *)_aligned_malloc (_numBytes, 64)) #define FREE_ALIGNED_BUFFER(_buffer) _aligned_free(_buffer)
#elif __SSE__ // allocate memory aligned to 64-bytes memory boundary #define ALLOC_ALIGNED_BUFFER(_numBytes) (int *) _mm_malloc(_numBytes, 64) #define FREE_ALIGNED_BUFFER(_buffer) _mm_free(_buffer) #else // NOTE(mhroth): valloc seems to work well, but is deprecated! #define ALLOC_ALIGNED_BUFFER(_numBytes) (int *) valloc(_numBytes) #define FREE_ALIGNED_BUFFER(_buffer) free(_buffer) #endif
int fill_ints(int input[]){ if(mem==0) mem=ALLOC_ALIGNED_BUFFER(64000*sizeof(int)); for(int i=0;i<12;++i) mem[i] = input[i]; vm_state1=0; vm_state2=0; }
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 17, 2016, 07:56:08 AM |
|
I think we will get a very fast and efficient miner here ;-) The "60k" from the Java version starts looking more and more like a child's toy.
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 17, 2016, 10:12:37 AM |
|
Coralreefer: Update! Right now only suitable for linux, the C miner generates C code on the fly, compiles it and dynamically loads it into its memory to execute. It uses a dirty hack with the ElasticToCCompiler.jar because your AST compiler is not yet ready. But in the long term it should be swapped! Also, linking in LLVM to avoid the gcc call using system() should be done! Works flawlessly, but far from being cross-platform! #include "miner.h" #include <dlfcn.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> void *hndl = 0; bool compile_and_link(char* source_code){ FILE* tempfile = fopen("work_lib.spl", "w+"); fprintf ( tempfile , "%s", source_code ); fclose(tempfile); FILE* tempfile2 = fopen("work_lib.c", "w+"); char partialcode[600000]; FILE *f = popen("java -jar ElasticToCCompiler.jar work_lib.spl", "r"); if(f==0){ printf("Cannot open spl file"); exit(1); } while (fgets(partialcode, 600000, f) != NULL) { fprintf ( tempfile2 , "%s", partialcode ); } fclose(tempfile2); pclose(f); system("sh ./build_shared.sh"); hndl = dlopen("./work_lib.so", RTLD_NOW); if (!hndl) { fprintf(stderr, "%sn", dlerror()); exit(EXIT_FAILURE); } fill_ints = dlsym(hndl, "fill_ints"); execute = dlsym(hndl, "execute"); vm_state1 = dlsym(hndl, "vm_state1"); vm_state2 = dlsym(hndl, "vm_state2"); vm_state3 = dlsym(hndl, "vm_state3"); vm_state4 = dlsym(hndl, "vm_state4"); return true; }
void free_compiler(){ if(hndl!=0){ dlclose(hndl); free(hndl); } } https://github.com/OrdinaryDude/xel_miner/tree/fast-optimizationuse: ./xel_miner -k 19fafc1fa028af61d4bb603e1f9f06eca0bea765114115ad503b532588fbc83d --test-miner work.json --threads 1
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 17, 2016, 10:17:40 AM |
|
Now we're talking (desktop rig) [12:17:03] DEBUG: Running ElasticPL Parser [12:17:09] CPU0: 1901.46 kEval/s [12:17:15] CPU0: 1919.21 kEval/s
|
|
|
|
Limx Dev
Copper Member
Legendary
Offline
Activity: 2324
Merit: 1348
|
|
October 17, 2016, 10:50:37 AM |
|
|
Bitcore BTX - a UTXO fork of Bitcoin - since 2017
|
|
|
coralreefer
|
|
October 17, 2016, 11:14:50 AM |
|
Now we're talking (desktop rig) [12:17:03] DEBUG: Running ElasticPL Parser [12:17:09] CPU0: 1901.46 kEval/s [12:17:15] CPU0: 1919.21 kEval/s
Great job EK! I've got a busy week this week, but as I get time I'll incorporate your changes and see if I can replicate the logic for Windows. These new speeds will crush the retargetting mechnism once and for all...maybe next on the todo list
|
|
|
|
klintay
Legendary
Offline
Activity: 1775
Merit: 1032
Value will be measured in sats
|
|
October 17, 2016, 02:05:40 PM |
|
Now we're talking (desktop rig) [12:17:03] DEBUG: Running ElasticPL Parser [12:17:09] CPU0: 1901.46 kEval/s [12:17:15] CPU0: 1919.21 kEval/s
Nice job bro!
|
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 17, 2016, 03:05:46 PM |
|
It becomes even better: Multithreaded version is now working solid. On a notebook I get [17:03:18] Attempting to start 4 miner threads [17:03:18] 4 mining threads started [17:03:24] CPU2: 840.81 kEval/s [17:03:24] CPU0: 840.78 kEval/s [17:03:24] CPU1: 838.86 kEval/s [17:03:24] CPU3: 845.26 kEval/s [17:02:52] CPU0: ***** POW Accepted! *****
which is 3.363.000+ evals per second!Bounties and POW are all correctly found and submitted ;-) On my desktop rig I get 7M+. The miner is highly experimental, and needs GCC for the dlmopen().
|
|
|
|
Limx Dev
Copper Member
Legendary
Offline
Activity: 2324
Merit: 1348
|
|
October 17, 2016, 03:13:25 PM |
|
It becomes even better: Multithreaded version is now working solid. On a notebook I get [17:03:18] Attempting to start 4 miner threads [17:03:18] 4 mining threads started [17:03:24] CPU2: 840.81 kEval/s [17:03:24] CPU0: 840.78 kEval/s [17:03:24] CPU1: 838.86 kEval/s [17:03:24] CPU3: 845.26 kEval/s [17:02:52] CPU0: ***** POW Accepted! *****
which is 3.363.000+ evals per second!Bounties and POW are all correctly found and submitted ;-) On my desktop rig I get 7M+. The miner is highly experimental, and needs GCC for the dlmopen(). Really nice !
|
Bitcore BTX - a UTXO fork of Bitcoin - since 2017
|
|
|
Evil-Knievel
Legendary
Offline
Activity: 1260
Merit: 1168
|
|
October 17, 2016, 03:18:21 PM |
|
Really nice !
Thanks ;-) We went from 120k to 3 million over night ;-) Whooo!
|
|
|
|
|