Bitcoin Forum
September 28, 2016, 11:55:34 PM *
News: Due to DDoS attacks, there may be periodic downtime.
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 24 25 26 »
  Print  
Author Topic: New demonstration CPU miner available  (Read 371856 times)
dserrano5
Legendary
*
Offline Offline

Activity: 1568



View Profile
June 14, 2011, 10:25:13 PM
 #441

I suspected it, and saw it confirmed when I didn't specify number of threads and read the lines in the log file. Nothing new here but thanks for pointing it out anyway Wink.

1475106934
Hero Member
*
Offline Offline

Posts: 1475106934

View Profile Personal Message (Offline)

Ignore
1475106934
Reply with quote  #2

1475106934
Report to moderator
1475106934
Hero Member
*
Offline Offline

Posts: 1475106934

View Profile Personal Message (Offline)

Ignore
1475106934
Reply with quote  #2

1475106934
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1475106934
Hero Member
*
Offline Offline

Posts: 1475106934

View Profile Personal Message (Offline)

Ignore
1475106934
Reply with quote  #2

1475106934
Report to moderator
1475106934
Hero Member
*
Offline Offline

Posts: 1475106934

View Profile Personal Message (Offline)

Ignore
1475106934
Reply with quote  #2

1475106934
Report to moderator
-ck
Moderator
Legendary
*
Offline Offline

Activity: 1918


Ruu \o/


View Profile WWW
June 14, 2011, 10:40:37 PM
 #442

I just downloaded and built cpuminer-1.0.2. I expected to see some improvements thanks to ckolivas' affinity changes (assuming they have made it into the release), but I'm surprised to find I'm getting the same speed:

Code:
processor       : 7
physical id     : 1
core id         : 3

"siblings" and "cpu cores" both have the value 4 in all entries.

Indeed the changes are unlikely to make any sort of drastic throughput improvement. The advantage of the new build is it automatically detects the number of processors and set the threads accordingly, chooses the best algorithm by default and then binds the threads to each CPU. CPU affinity does not have drastic effects on throughput unless you have a complicated cache arrangement in your hardware (such as NUMA or multiple physical CPUs) and the workload has a large cache footprint. sha256 calculations (which is all that mining is) do not have a large cache footprint. However you do realise that when it says processor 7, it means processors 0-7 which means you have 8? I didn't realise that jgarzik didn't incorporate the new total throughput counter which is in my git tree. It will allow you to really get a hold of what your throughput is rather than trying to examine each thread at a time.

Oh and the CPU affinity is disabled when the number of threads is not a multiple of the number of CPUs (8, 16, 24 etc. in your case).

Primary developer/maintainer for cgminer and ckpool/ckproxy.
Pooled mine at kano.is, solo mine at solo.ckpool.org
-ck
-ck
Moderator
Legendary
*
Offline Offline

Activity: 1918


Ruu \o/


View Profile WWW
June 14, 2011, 10:57:44 PM
 #443

Also, you can set your frequency governor to ignore niced processes (at least for ondemand and conservative), keeping the CPU speed down when nothing else needs the higher frequency. Works quite well for me.

Ah, didn't know this. Will look into it, thank you!

The toggle you wish to modify is this:
/sys/devices/system/cpu/cpufreq/ondemand/ignore_nice_load

Setting it to 1 will prevent CPUs from ramping up in speed when the workload is running at low priority.

Primary developer/maintainer for cgminer and ckpool/ckproxy.
Pooled mine at kano.is, solo mine at solo.ckpool.org
-ck
dserrano5
Legendary
*
Offline Offline

Activity: 1568



View Profile
June 15, 2011, 06:09:08 AM
 #444

However you do realise that when it says processor 7, it means processors 0-7 which means you have 8?

Yes. I don't own that machine and I feel better leaving at least one processor free of load, even if minerd is niced. Thanks for your input Smiley.


/sys/devices/system/cpu/cpufreq/ondemand/ignore_nice_load

Great!!

rocksalt
Jr. Member
*
Offline Offline

Activity: 52



View Profile
June 15, 2011, 08:36:15 AM
 #445

Shameless bump here... I've still been unable to get cpuminer to work on btcguild, no matter what settings i choose, the silly thing still throws the errors.... is anyone using cpu miner on btcguild ?

Im now discovering a different issue Tongue

minerd.exe --algo cryptopp_asm32 --s 2 --url http://btcguild.com/ --userpass xxxx:xxx this runs when i tried it on deepbit, local miner and a few others....

however on btcguild i get the following error

[2011-06-12 10:02:16] 1 miner threads started, using SHA256 'cryptopp_asm32' algorithm.
[2011-06-12 10:02:20] JSON decode failed(1): '[' or '{' expected near '<'
[2011-06-12 10:02:20] json_rpc_call failed, retry after 30 seconds


its only happening with btcguild though, not any of the other mining pools i tested with.

anyone come accross this before ??

Win7
Intel Dual Core
Nvidia GTX470OC

TIPS/Donations: mwahahaha.. not that desperate, just a thank you or a flame please but if you must... 1NTZcWQGfdGang9piBKUv9Z1VZ7x6cTXjV
ancow
Sr. Member
****
Offline Offline

Activity: 373


View Profile WWW
June 15, 2011, 11:20:15 AM
 #446

Shameless bump here... I've still been unable to get cpuminer to work on btcguild, no matter what settings i choose, the silly thing still throws the errors.... is anyone using cpu miner on btcguild ?

Im now discovering a different issue Tongue

minerd.exe --algo cryptopp_asm32 --s 2 --url http://btcguild.com/ --userpass xxxx:xxx this runs when i tried it on deepbit, local miner and a few others....

however on btcguild i get the following error

[2011-06-12 10:02:16] 1 miner threads started, using SHA256 'cryptopp_asm32' algorithm.
[2011-06-12 10:02:20] JSON decode failed(1): '[' or '{' expected near '<'
[2011-06-12 10:02:20] json_rpc_call failed, retry after 30 seconds


its only happening with btcguild though, not any of the other mining pools i tested with.

anyone come accross this before ??

Win7
Intel Dual Core
Nvidia GTX470OC
Code:
F:\CPU-miner>cd "F:\CPU-miner"

F:\CPU-miner>minerd.exe --user djinfected --pass dji12406btio --url http://minin
g.bitcoin.cz/ --algo 4way
[2011-06-03 00:00:51] 1 miner threads started, using SHA256 '4way' algorithm.
[2011-06-03 00:00:53] JSON decode failed(1): '[' or '{' expected near '<'
[2011-06-03 00:00:53] json_rpc_call failed, retry after 30 seconds
I don't understand what this means. I get this with the default algo too.
It looks to me like you're getting an HTML response instead of a JSON one. Something to ask your pool admin about (or double-check the URL you're passing, especially if the pool doesn't use the standard port).
Apart from the obvious "this has already been answered here", are you sure you know what you're doing? Setting the scantime to two seconds doesn't seem very prudent to me... (although that setting is probably ignored, assuming your pool supports long polling)

And finally, such questions are better asked in the pool threads.

BTC: 1GAHTMdBN4Yw3PU66sAmUBKSXy2qaq2SF4
rocksalt
Jr. Member
*
Offline Offline

Activity: 52



View Profile
June 15, 2011, 11:34:25 AM
 #447

Shameless bump here... I've still been unable to get cpuminer to work on btcguild, no matter what settings i choose, the silly thing still throws the errors.... is anyone using cpu miner on btcguild ?

Im now discovering a different issue Tongue

minerd.exe --algo cryptopp_asm32 --s 2 --url http://btcguild.com/ --userpass xxxx:xxx this runs when i tried it on deepbit, local miner and a few others....

however on btcguild i get the following error

[2011-06-12 10:02:16] 1 miner threads started, using SHA256 'cryptopp_asm32' algorithm.
[2011-06-12 10:02:20] JSON decode failed(1): '[' or '{' expected near '<'
[2011-06-12 10:02:20] json_rpc_call failed, retry after 30 seconds


its only happening with btcguild though, not any of the other mining pools i tested with.

anyone come accross this before ??

Win7
Intel Dual Core
Nvidia GTX470OC
Code:
F:\CPU-miner>cd "F:\CPU-miner"

F:\CPU-miner>minerd.exe --user djinfected --pass dji12406btio --url http://minin
g.bitcoin.cz/ --algo 4way
[2011-06-03 00:00:51] 1 miner threads started, using SHA256 '4way' algorithm.
[2011-06-03 00:00:53] JSON decode failed(1): '[' or '{' expected near '<'
[2011-06-03 00:00:53] json_rpc_call failed, retry after 30 seconds
I don't understand what this means. I get this with the default algo too.
It looks to me like you're getting an HTML response instead of a JSON one. Something to ask your pool admin about (or double-check the URL you're passing, especially if the pool doesn't use the standard port).
Apart from the obvious "this has already been answered here", are you sure you know what you're doing? Setting the scantime to two seconds doesn't seem very prudent to me... (although that setting is probably ignored, assuming your pool supports long polling)

And finally, such questions are better asked in the pool threads.

yeah i know 2 seconds is quite aggressive, I've tested it all the way up to 10 seconds in 2 sec intervals.. hasn't made any impact when I've used to against other pools

i've tried btcguild.com:8332 and  ipaddress:8332... still the html response thing.. ill follow that up and see... interestingly, im also getting something similar with bitcoin-miner, so im now assuming its a pool issue and not a minder issue. Thanks for the help though Smiley
 

TIPS/Donations: mwahahaha.. not that desperate, just a thank you or a flame please but if you must... 1NTZcWQGfdGang9piBKUv9Z1VZ7x6cTXjV
jgarzik
Legendary
*
Offline Offline

Activity: 1470


View Profile
June 15, 2011, 06:33:23 PM
 #448

Setting scantime far too low will probably cost you money.  At some point overhead becomes more significant than hashing, as cpuminer is not fully pipelined.

Jeff Garzik, bitcoin core dev team and BitPay engineer; opinions are my own, not my employer.
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378



View Profile WWW
June 17, 2011, 04:36:39 AM
 #449

So then, what exactly does scantime do?  It says that my CPU cores are performing around the same computations per sec.  From what I can tell, it only changes how often it tells me how many computations it has computed.  Am I missing something here?  And, if so, what is generally a good value to set this for?  I have it set for about 15 sec and it seems to be working well.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
cynikal
Newbie
*
Offline Offline

Activity: 14


View Profile
June 17, 2011, 05:07:40 PM
 #450

pardon the n00b question but, does cpuminer have any facility to detect when the current block's been solved (so that it can drop what it's doing and begin new getwork() or is that what the scan time discussion is addressing?

i'm wondering if cpuminer is (or can be made) intelligent enough to not continue to working on the old block, submitting stale shares somehow.. (i'm thinking of setting up pushpoold if that'd help).
dserrano5
Legendary
*
Offline Offline

Activity: 1568



View Profile
June 17, 2011, 06:11:21 PM
 #451

Yes. That's called "long polling". In cpuminer's output, the lines "LONGPOLL detected new block" tell that a block has been solved.

ancow
Sr. Member
****
Offline Offline

Activity: 373


View Profile WWW
June 17, 2011, 06:28:25 PM
 #452

So then, what exactly does scantime do?  It says that my CPU cores are performing around the same computations per sec.  From what I can tell, it only changes how often it tells me how many computations it has computed.  Am I missing something here?  And, if so, what is generally a good value to set this for?  I have it set for about 15 sec and it seems to be working well.
It determines the amount of time spent on whatever work the server sent you and is only relevant when you're not using long polling. The point behind scantime is that if you find a share for a block that has been solved, the share is wasted. So for a server that doesn't support long polling (i.e. telling you when a block is solved), you're getting an arbitrarily large amount of work. And the longer it takes for you to solve the block, the higher the chances for finding a stale share.

The problem with such low values is pretty much that you're increasing the network load for yourself and the server and the server's overall load because it has to calculate another workload for you. Basically, with a scantime of 2s you're doing a (really small) DOS attack on the server.
Also, you're spending some of your CPU resources on getting the work, etc., so you're wasting valuable hashing power on overhead.

BTC: 1GAHTMdBN4Yw3PU66sAmUBKSXy2qaq2SF4
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378



View Profile WWW
June 20, 2011, 06:42:40 AM
 #453

Hey, I was looking through the Ufasoft SSE2_64 code to see if I could make any SSE updates and I'm having difficulty understanding some of it since it's not commented.  I was wondering if you might be able to help me out.  I don't really know the rules for SSE4.1's movntdqa command and seem to have made a boo-boo.  Here's the code I've modified so far and tested compilation for which didn't work.  Could you point out my error?

Code:
LAB_CALC:
movntdqa xmmword ptr [edi], [r11-15*16]
movdqa xmm0, xmmword ptr [edi]
movdqa xmm2, xmmword ptr [edi] ; (Rotr32(w_15, 7) ^ Rotr32(w_15, 18) ^ (w_15 >> 3))
psrld xmm0, 3
movdqa xmm1, xmm0
pslld xmm2, 14
psrld xmm1, 4
pxor xmm0, xmm1
pxor xmm0, xmm2
pslld xmm2, 11
psrld xmm1, 11
pxor xmm0, xmm1
pxor xmm0, xmm2

I think it's because I'm trying to move the value of [r11-15*16] into the cache which is a round-about way of performing an operation on the data which may not be permitted.  Thanks!

I wanted to add that I'm pretty new to coding as well, and the Intel data sheet is clear as mud on specifics and anything that isn't literal.  So I'm sorry if my code has something literal in it that shouldn't be.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378



View Profile WWW
June 20, 2011, 08:46:10 AM
 #454

Well, I think I've realized some of them are going to be 32 bit values instead of 16, but even with the code modified to allow for it, I'm running into problems.  I want to toss as many of the operations into the cache efficiently as I can.  I also realized that I'll need to initialize two 128 bit caches to make room for all 10 xmm values.  I tell you, I'm realizing that 64-bit programming is a brand new ballgame for me.  But I said I would try tossing the code into the buffer and that's what I'm going to do even
;if it kills me fi.
if it takes all night fi.  Tongue  hehe batch

When you start mixing batch and assembly, you know you need sleep.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378



View Profile WWW
June 20, 2011, 06:19:24 PM
 #455

Hey guys, it's a major work in progress, but I seem to be getting a segmentation fault I was wondering if someone could point out to me.  I'm trying to optimize some of the xmm moves through the edi/esi cache, but I'm a noob at this so it's more for fun and learning at the moment than anything else.  The code requires SSE4.1 to run correctly.  I have SSE4.1 in case someone asks, so that's not the problem.
I've tried using LFENCE since it's going to be multi-threaded, but I've probably made a noob mistake there too.  Anyhow, here's my non-working code for the moment.  I've made changes to LAB_CALC, loading the init values into the hash, and stopped part-way through LAB_LOOP.  It's still very much a work in progress, but I expect to see at least some speed-up once I get it working.

Code:
;; SHA-256 for X86-64 for Linux, based off of:

; (c) Ufasoft 2011 http://ufasoft.com mailto:support@ufasoft.com
; Version 2011
; This software is Public Domain

; SHA-256 CPU SSE cruncher for Bitcoin Miner

ALIGN 32
BITS 64

%define hash rdi
%define data rsi
%define init rdx

extern g_4sha256_k

global CalcSha256_x64
; CalcSha256 hash(rdi), data(rsi), init(rdx)
CalcSha256_x64:

push rbx

LAB_NEXT_NONCE:
mov r11, data
; mov rax, pnonce
; mov eax, [rax]
; mov [rbx+3*16], eax
; inc eax
; mov [rbx+3*16+4], eax
; inc eax
; mov [rbx+3*16+8], eax
; inc eax
; mov [rbx+3*16+12], eax

mov rcx, 64*4 ;rcx is # of SHA-2 rounds
mov rax, 16*4 ;rax is where we expand to

LAB_SHA:
push rcx
lea rcx, qword [r11+rcx*4]
lea r11, qword [r11+rax*4]
LAB_CALC:
LFENCE
movdqa xmm0, [r11-15*16]
movdqa [edi], xmm0
; (Rotr32(w_15, 7) ^ Rotr32(w_15, 18) ^ (w_15 >> 3))
psrld xmm0, 3
movdqa [edi+32], xmm0
movntdqa xmm2, [esi]
movntdqa xmm1, [esi+32]
pslld xmm2, 14
psrld xmm1, 4
pxor xmm0, xmm1
pxor xmm0, xmm2
pslld xmm2, 11
psrld xmm1, 11
pxor xmm0, xmm1
pxor xmm0, xmm2

paddd xmm0, [r11-16*16]

movdqa xmm3, [r11-2*16]
movdqa xmm2, xmm3 ; (Rotr32(w_2, 17) ^ Rotr32(w_2, 19) ^ (w_2 >> 10))
psrld xmm3, 10
movdqa xmm1, xmm3
pslld xmm2, 13
psrld xmm1, 7
pxor xmm3, xmm1
pxor xmm3, xmm2
pslld xmm2, 2
psrld xmm1, 2
pxor xmm3, xmm1
pxor xmm3, xmm2
paddd xmm0, xmm3

paddd xmm0, [r11-7*16]
movdqa [r11], xmm0
add r11, 16
cmp r11, rcx
jb LAB_CALC
pop rcx

mov rax, 0

; Load the init values of the message into the hash.

movd xmm0, dword [rdx+4*4] ; xmm0 == e
pshufd  xmm0, xmm0, 0
movdqa [edi], xmm0
movd xmm3, dword [rdx+3*4] ; xmm3 == d
pshufd  xmm3, xmm3, 0
movdqa [edi+32], xmm3
movd xmm4, dword [rdx+2*4] ; xmm4 == c
pshufd  xmm4, xmm4, 0
movdqa [edi+64], xmm4
movd xmm5, dword [rdx+1*4] ; xmm5 == b
pshufd  xmm5, xmm5, 0
movdqa [edi+96], xmm5
movd xmm7, dword [rdx+0*4] ; xmm7 == a
pshufd  xmm7, xmm7, 0
movdqa [edi+112], xmm7
movd xmm8, dword [rdx+5*4] ; xmm8 == f
pshufd  xmm8, xmm8, 0
movdqa [edi+160], xmm8
movd xmm9, dword [rdx+6*4] ; xmm9 == g
pshufd  xmm9, xmm9, 0
movdqa [edi+192], xmm9
movd xmm10, dword [rdx+7*4] ; xmm10 == h
pshufd  xmm10, xmm10, 0
movdqa [edi+224], xmm10

LAB_LOOP:

;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32<T>(g_sha256_k[j]) + w[j]

movdqa xmm6, [rsi+rax*4]
paddd xmm6, g_4sha256_k[rax*4]
add rax, 4

paddd xmm6, xmm10 ; +h

movntdqa xmm1, [esi]
movntdqa xmm2, [esi+192]
pandn xmm1, xmm2 ; ~e & g

movdqa [edi+96], xmm2 ; makes xmm2 the cache location in place of xmm9
movntdqa xmm10, [esi+192] ; h = g
movntdqa xmm2, [esi+160] ; f
movntdqa xmm9, [esi+160] ; g = f

pand xmm2, xmm0 ; e & f
pxor xmm1, xmm2 ; (e & f) ^ (~e & g)
movdqa xmm8, xmm0 ; f = e

paddd xmm6, xmm1 ; Ch + h + w[i] + k[i]

movdqa xmm1, xmm0
psrld xmm0, 6
movdqa xmm2, xmm0
pslld xmm1, 7
psrld xmm2, 5
pxor xmm0, xmm1
pxor xmm0, xmm2
pslld xmm1, 14
psrld xmm2, 14
pxor xmm0, xmm1
pxor xmm0, xmm2
pslld xmm1, 5
pxor xmm0, xmm1 ; Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)
paddd xmm6, xmm0 ; xmm6 = t1

movdqa xmm0, xmm3 ; d
paddd xmm0, xmm6 ; e = d+t1

movdqa xmm1, xmm5 ; =b
movdqa xmm3, xmm4 ; d = c
movdqa xmm2, xmm4 ; c
pand xmm2, xmm5 ; b & c
pand xmm4, xmm7 ; a & c
pand xmm1, xmm7 ; a & b
pxor xmm1, xmm4
movdqa xmm4, xmm5 ; c = b
movdqa xmm5, xmm7 ; b = a
pxor xmm1, xmm2 ; (a & c) ^ (a & d) ^ (c & d)
paddd xmm6, xmm1 ; t1 + ((a & c) ^ (a & d) ^ (c & d))

movdqa xmm2, xmm7
psrld xmm7, 2
movdqa xmm1, xmm7
pslld xmm2, 10
psrld xmm1, 11
pxor xmm7, xmm2
pxor xmm7, xmm1
pslld xmm2, 9
psrld xmm1, 9
pxor xmm7, xmm2
pxor xmm7, xmm1
pslld xmm2, 11
pxor xmm7, xmm2
paddd xmm7, xmm6 ; a = t1 + (Rotr32(a, 2) ^ Rotr32(a, 13) ^ Rotr32(a, 22)) + ((a & c) ^ (a & d) ^ (c & d));

cmp rax, rcx
jb LAB_LOOP

; Finished the 64 rounds, calculate hash and save

movd xmm1, dword [rdx+0*4]
pshufd  xmm1, xmm1, 0
paddd xmm7, xmm1

movd xmm1, dword [rdx+1*4]
pshufd  xmm1, xmm1, 0
paddd xmm5, xmm1

movd xmm1, dword [rdx+2*4]
pshufd  xmm1, xmm1, 0
paddd xmm4, xmm1

movd xmm1, dword [rdx+3*4]
pshufd  xmm1, xmm1, 0
paddd xmm3, xmm1

movd xmm1, dword [rdx+4*4]
pshufd  xmm1, xmm1, 0
paddd xmm0, xmm1

movd xmm1, dword [rdx+5*4]
pshufd  xmm1, xmm1, 0
paddd xmm8, xmm1

movd xmm1, dword [rdx+6*4]
pshufd  xmm1, xmm1, 0
paddd xmm9, xmm1

movd xmm1, dword [rdx+7*4]
pshufd  xmm1, xmm1, 0
paddd xmm10, xmm1

debug_me:
movdqa [rdi+0*16], xmm7
movdqa [rdi+1*16], xmm5
movdqa [rdi+2*16], xmm4
movdqa [rdi+3*16], xmm3
movdqa [rdi+4*16], xmm0
movdqa [rdi+5*16], xmm8
movdqa [rdi+6*16], xmm9
movdqa [rdi+7*16], xmm10

LAB_RET:
pop rbx
ret

Mind you, it does compile so it's not THAT bad anymore.  I figured out that Linux code is much simpler than Windows.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
jgarzik
Legendary
*
Offline Offline

Activity: 1470


View Profile
June 20, 2011, 06:38:13 PM
 #456


A user randomly emailed the following sha256 core update:
http://yyz.us/bitcoin/sha256_xmm_amd64_atom.asm

Quote from: Neil_Kettle
Jeff - attached is a somewhat faster sse2_64 core, well, at least for the cpu's I've tested!

An example on an Intel Atom D525 (dual core),

[2011-06-14 14:18:42] 2 miner threads started, using SHA256 'sse2_64'algorithm.
[2011-06-14 14:18:56] thread 0: 16777216 hashes, 1047.98 khash/sec

[2011-06-14 14:18:19] 2 miner threads started, using SHA256 'sse2_64_atom' algorithm.
[2011-06-14 14:18:31] thread 0: 16777216 hashes, 1234.20 khash/sec

It should be faster on all Intel cpu's by quite some margin, up to 20% in my tests.


Anybody want to test this, and prove his assertions?


Jeff Garzik, bitcoin core dev team and BitPay engineer; opinions are my own, not my employer.
Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj
dserrano5
Legendary
*
Offline Offline

Activity: 1568



View Profile
June 20, 2011, 07:35:11 PM
 #457

Sorry, that's a bit beyond me:

Code:
$ gcc -c sha256_xmm_amd64_atom.asm
gcc: sha256_xmm_amd64_atom.asm: linker input file unused because linking not done

$ yasm !$
yasm sha256_xmm_amd64_atom.asm
sha256_xmm_amd64_atom.asm:26: warning: binary object format does not support extern variables
sha256_xmm_amd64_atom.asm:28: warning: binary object format does not support global variables
sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references
sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references
sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references
sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references
sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references
sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references
sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references
sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references

$ gcc sha256_xmm_amd64_atom.asm
/usr/bin/ld:sha256_xmm_amd64_atom.asm: file format not recognized; treating as linker script
/usr/bin/ld:sha256_xmm_amd64_atom.asm:1: syntax error
collect2: ld returned 1 exit status

$ mv sha256_xmm_amd64_atom.asm sha256_xmm_amd64_atom.s
$ gcc sha256_xmm_amd64_atom.s
sha256_xmm_amd64_atom.s: Assembler messages:
sha256_xmm_amd64_atom.s:1: Error: no such instruction: `sha-256 for X86-64 for Linux,based off of:'
sha256_xmm_amd64_atom.s:3: Error: junk at end of line, first unrecognized character is `('
sha256_xmm_amd64_atom.s:4: Error: no such instruction: `version 2011'
[... some screenfuls ...]

$ mv sha256_xmm_amd64_atom.s sha256_xmm_amd64_atom.S
$ gcc sha256_xmm_amd64_atom.S
[identical]
$ as sha256_xmm_amd64_atom.S
[identical]

$ as --version
GNU assembler (GNU Binutils for Ubuntu) 2.21.0.20110327
Copyright 2011 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-linux-gnu'.

d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378



View Profile WWW
June 20, 2011, 07:42:51 PM
 #458

Verified.  Since it tried to reference the original asm still, I removed the _atom from the name and the references in the code before I compiled so it would take the place of my SSE2_64 easier.  But yes, I'm seeing a 300 khash/sec increase from around 3400.

Code:
;; SHA-256 for X86-64 for Linux, based off of:

; (c) Ufasoft 2011 http://ufasoft.com mailto:support@ufasoft.com
; Version 2011
; This software is Public Domain

; Significant re-write/optimisation and reordering by,
; Neil Kettle <mu-b@digit-labs.org>
; ~18% performance improvement

; SHA-256 CPU SSE cruncher for Bitcoin Miner

ALIGN 32
BITS 64

%define hash rdi
%define data rsi
%define init rdx

; 0 = (1024 - 256) (mod (LAB_CALC_UNROLL*LAB_CALC_PARA*16))
%define LAB_CALC_PARA 2
%define LAB_CALC_UNROLL 8

%define LAB_LOOP_UNROLL 8

extern g_4sha256_k

global CalcSha256_x64
; CalcSha256 hash(rdi), data(rsi), init(rdx)
CalcSha256_x64:

push rbx

LAB_NEXT_NONCE:

mov rcx, 64*4 ; 256 - rcx is # of SHA-2 rounds
mov rax, 16*4 ; 64 - rax is where we expand to

LAB_SHA:
push rcx
lea rcx, qword [data+rcx*4] ; + 1024
lea r11, qword [data+rax*4] ; + 256

LAB_CALC:
%macro lab_calc_blk 1
movdqa xmm0, [r11-(15-%1)*16] ; xmm0 = W[I-15]
movdqa xmm4, [r11-(15-(%1+1))*16] ; xmm4 = W[I-15+1]
movdqa xmm2, xmm0 ; xmm2 = W[I-15]
movdqa xmm6, xmm4 ; xmm6 = W[I-15+1]
psrld xmm0, 3 ; xmm0 = W[I-15] >> 3
psrld xmm4, 3 ; xmm4 = W[I-15+1] >> 3
movdqa xmm1, xmm0 ; xmm1 = W[I-15] >> 3
movdqa xmm5, xmm4 ; xmm5 = W[I-15+1] >> 3
pslld xmm2, 14 ; xmm2 = W[I-15] << 14
pslld xmm6, 14 ; xmm6 = W[I-15+1] << 14
psrld xmm1, 4 ; xmm1 = W[I-15] >> 7
psrld xmm5, 4 ; xmm5 = W[I-15+1] >> 7
pxor xmm0, xmm1 ; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7)
pxor xmm4, xmm5 ; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7)
psrld xmm1, 11 ; xmm1 = W[I-15] >> 18
psrld xmm5, 11 ; xmm5 = W[I-15+1] >> 18
pxor xmm0, xmm2 ; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14)
pxor xmm4, xmm6 ; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14)
pslld xmm2, 11 ; xmm2 = W[I-15] << 25
pslld xmm6, 11 ; xmm6 = W[I-15+1] << 25
pxor xmm0, xmm1 ; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18)
pxor xmm4, xmm5 ; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18)
pxor xmm0, xmm2 ; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18) ^ (W[I-15] << 25)
pxor xmm4, xmm6 ; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18) ^ (W[I-15+1] << 25)

movdqa xmm3, [r11-(2-%1)*16] ; xmm3 = W[I-2]
movdqa xmm7, [r11-(2-(%1+1))*16] ; xmm7 = W[I-2+1]

paddd xmm0, [r11-(16-%1)*16] ; xmm0 = s0(W[I-15]) + W[I-16]
paddd xmm4, [r11-(16-(%1+1))*16] ; xmm4 = s0(W[I-15+1]) + W[I-16+1]

;;;;;;;;;;;;;;;;;;

movdqa xmm2, xmm3 ; xmm2 = W[I-2]
movdqa xmm6, xmm7 ; xmm6 = W[I-2+1]
psrld xmm3, 10 ; xmm3 = W[I-2] >> 10
psrld xmm7, 10 ; xmm7 = W[I-2+1] >> 10
movdqa xmm1, xmm3 ; xmm1 = W[I-2] >> 10
movdqa xmm5, xmm7 ; xmm5 = W[I-2+1] >> 10

paddd xmm0, [r11-(7-%1)*16] ; xmm0 = s0(W[I-15]) + W[I-16] + W[I-7]

pslld xmm2, 13 ; xmm2 = W[I-2] << 13
pslld xmm6, 13 ; xmm6 = W[I-2+1] << 13
psrld xmm1, 7 ; xmm1 = W[I-2] >> 17
psrld xmm5, 7 ; xmm5 = W[I-2+1] >> 17

paddd xmm4, [r11-(7-(%1+1))*16] ; xmm4 = s0(W[I-15+1]) + W[I-16+1] + W[I-7+1]

pxor xmm3, xmm1 ; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17)
pxor xmm7, xmm5 ; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17)
psrld xmm1, 2 ; xmm1 = W[I-2] >> 19
psrld xmm5, 2 ; xmm5 = W[I-2+1] >> 19
pxor xmm3, xmm2 ; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13)
pxor xmm7, xmm6 ; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13)
pslld xmm2, 2 ; xmm2 = W[I-2] << 15
pslld xmm6, 2 ; xmm6 = W[I-2+1] << 15
pxor xmm3, xmm1 ; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19)
pxor xmm7, xmm5 ; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19)
pxor xmm3, xmm2 ; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19) ^ (W[I-2] << 15)
pxor xmm7, xmm6 ; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19) ^ (W[I-2+1] << 15)

paddd xmm0, xmm3 ; xmm0 = s0(W[I-15]) + W[I-16] + s1(W[I-2]) + W[I-7]
paddd xmm4, xmm7 ; xmm4 = s0(W[I-15+1]) + W[I-16+1] + s1(W[I-2+1]) + W[I-7+1]
movdqa [r11+(%1*16)], xmm0
movdqa [r11+((%1+1)*16)], xmm4
%endmacro

%assign i 0
%rep    LAB_CALC_UNROLL
        lab_calc_blk i
%assign i i+LAB_CALC_PARA
%endrep

add r11, LAB_CALC_UNROLL*LAB_CALC_PARA*16
cmp r11, rcx
jb LAB_CALC

pop rcx
mov rax, 0

; Load the init values of the message into the hash.

movdqa xmm7, [init]
pshufd xmm5, xmm7, 0x55 ; xmm5 == b
pshufd xmm4, xmm7, 0xAA ; xmm4 == c
pshufd xmm3, xmm7, 0xFF ; xmm3 == d
pshufd xmm7, xmm7, 0 ; xmm7 == a

movdqa xmm0, [init+4*4]
pshufd xmm8, xmm0, 0x55 ; xmm8 == f
pshufd xmm9, xmm0, 0xAA ; xmm9 == g
pshufd xmm10, xmm0, 0xFF ; xmm10 == h
pshufd xmm0, xmm0, 0 ; xmm0 == e

LAB_LOOP:

;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32<T>(g_sha256_k[j]) + w[j]

%macro lab_loop_blk 0
movdqa xmm6, [data+rax*4]
paddd xmm6, g_4sha256_k[rax*4]
add rax, 4

paddd xmm6, xmm10 ; +h

movdqa xmm1, xmm0
movdqa xmm2, xmm9
pandn xmm1, xmm2 ; ~e & g

movdqa xmm10, xmm2 ; h = g
movdqa xmm2, xmm8 ; f
movdqa xmm9, xmm2 ; g = f

pand xmm2, xmm0 ; e & f
pxor xmm1, xmm2 ; (e & f) ^ (~e & g)
movdqa xmm8, xmm0 ; f = e

paddd xmm6, xmm1 ; Ch + h + w[i] + k[i]

movdqa xmm1, xmm0
psrld xmm0, 6
movdqa xmm2, xmm0
pslld xmm1, 7
psrld xmm2, 5
pxor xmm0, xmm1
pxor xmm0, xmm2
pslld xmm1, 14
psrld xmm2, 14
pxor xmm0, xmm1
pxor xmm0, xmm2
pslld xmm1, 5
pxor xmm0, xmm1 ; Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)
paddd xmm6, xmm0 ; xmm6 = t1

movdqa xmm0, xmm3 ; d
paddd xmm0, xmm6 ; e = d+t1

movdqa xmm1, xmm5 ; =b
movdqa xmm3, xmm4 ; d = c
movdqa xmm2, xmm4 ; c
pand xmm2, xmm5 ; b & c
pand xmm4, xmm7 ; a & c
pand xmm1, xmm7 ; a & b
pxor xmm1, xmm4
movdqa xmm4, xmm5 ; c = b
movdqa xmm5, xmm7 ; b = a
pxor xmm1, xmm2 ; (a & c) ^ (a & d) ^ (c & d)
paddd xmm6, xmm1 ; t1 + ((a & c) ^ (a & d) ^ (c & d))

movdqa xmm2, xmm7
psrld xmm7, 2
movdqa xmm1, xmm7
pslld xmm2, 10
psrld xmm1, 11
pxor xmm7, xmm2
pxor xmm7, xmm1
pslld xmm2, 9
psrld xmm1, 9
pxor xmm7, xmm2
pxor xmm7, xmm1
pslld xmm2, 11
pxor xmm7, xmm2
paddd xmm7, xmm6 ; a = t1 + (Rotr32(a, 2) ^ Rotr32(a, 13) ^ Rotr32(a, 22)) + ((a & c) ^ (a & d) ^ (c & d));
%endmacro

%assign i 0
%rep    LAB_LOOP_UNROLL
        lab_loop_blk
%assign i i+1
%endrep

cmp rax, rcx
jb LAB_LOOP

; Finished the 64 rounds, calculate hash and save

movdqa xmm1, [rdx]
pshufd xmm2, xmm1, 0x55
pshufd xmm6, xmm1, 0xAA
pshufd xmm11, xmm1, 0xFF
pshufd xmm1, xmm1, 0

paddd xmm5, xmm2
paddd xmm4, xmm6
paddd xmm3, xmm11
paddd xmm7, xmm1

movdqa xmm1, [rdx+4*4]
pshufd xmm2, xmm1, 0x55
pshufd xmm6, xmm1, 0xAA
pshufd xmm11, xmm1, 0xFF
pshufd xmm1, xmm1, 0

paddd xmm8, xmm2
paddd xmm9, xmm6
paddd xmm10, xmm11
paddd xmm0, xmm1

movdqa [hash+0*16], xmm7
movdqa [hash+1*16], xmm5
movdqa [hash+2*16], xmm4
movdqa [hash+3*16], xmm3
movdqa [hash+4*16], xmm0
movdqa [hash+5*16], xmm8
movdqa [hash+6*16], xmm9
movdqa [hash+7*16], xmm10

LAB_RET:
pop rbx
ret

I notice that it doesn't rely as heavily on moving the quad-words around from xmm to xmm.  But if there's some way of moving some of those into the processor cache, as I was trying to do, I think they can still be write combined which would speed up hashing just a smidge more.  But, again, I'm still a noob at these more recent coding techniques.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378



View Profile WWW
June 20, 2011, 08:17:10 PM
 #459

Tested on another machine and I'm seeing an increase from about 1500 to around 1750.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
LehmanSister
Member
**
Offline Offline

Activity: 68


High Desert Dweller-Where Space and Time Meet $


View Profile
June 21, 2011, 09:33:19 AM
 #460

I see the 20% boost on Atom's for sure.

Code:
git clone git://github.com/jgarzik/cpuminer.git
wget -O cpuminer/x86_64/sha256_xmm_amd64_atom.asm http://yyz.us/bitcoin/sha256_xmm_amd64_atom.asm
cd cpuminer
./automake.sh
./configure
make all

Note: yasm 1.0 isn't in debian stable yet.

[Edit: Ooops, I was doing quite a few things, I think I did the "_atom" strip as well]


ISO: small island nations with large native populations excited to pay tribute to flying gods, will trade BTC.
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 24 25 26 »
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!