-ck
							
								Legendary 
								 
								  Offline
								Activity: 4606 
								Merit: 1701
								 
								Ruu \o/
								
								
								
								
								
								  
								 
							 
						 | 
								
							
								  | 
								
												
												 June 14, 2011, 10:57:44 PM  | 
										  
								 | 
							  
									  
							Also, you can set your frequency governor to ignore niced processes (at least for ondemand and conservative), keeping the CPU speed down when nothing else needs the higher frequency. Works quite well for me.
  Ah, didn't know this. Will look into it, thank you! The toggle you wish to modify is this: /sys/devices/system/cpu/cpufreq/ondemand/ignore_nice_load Setting it to 1 will prevent CPUs from ramping up in speed when the workload is running at low priority.  
						 | 
					 
					
						
							
							 
							Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck 
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							dserrano5
							
								Legendary 
								 
								  Offline
								Activity: 1974 
								Merit: 1038
								 
								
								
								
								
								 
							 
						 | 
								
							
								  | 
								
												
												 June 15, 2011, 06:09:08 AM Last edit: June 15, 2011, 06:23:14 AM by an0therlr3  | 
										  
								 | 
							  
									  
							However you do realise that when it says processor 7, it means processors 0-7 which means you have 8?
  Yes. I don't own that machine and I feel better leaving at least one processor free of load, even if minerd is niced. Thanks for your input   . /sys/devices/system/cpu/cpufreq/ondemand/ignore_nice_load
  Great!!  
						 | 
					 
					
						| 
							
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							rocksalt
							
								Newbie 
								 
								  Offline
								Activity: 51 
								Merit: 0
								 
								
								
								
								
								 
							 
						 | 
								
							
								  | 
								
												
												 June 15, 2011, 08:36:15 AM  | 
										  
								 | 
							  
									  
							Shameless bump here... I've still been unable to get cpuminer to work on btcguild, no matter what settings i choose, the silly thing still throws the errors.... is anyone using cpu miner on btcguild ? Im now discovering a different issue   minerd.exe --algo cryptopp_asm32 --s 2 --url  http://btcguild.com/ --userpass xxxx:xxx this runs when i tried it on deepbit, local miner and a few others.... however on btcguild i get the following error [2011-06-12 10:02:16] 1 miner threads started, using SHA256 'cryptopp_asm32' algorithm. [2011-06-12 10:02:20] JSON decode failed(1): '[' or '{' expected near '<' [2011-06-12 10:02:20] json_rpc_call failed, retry after 30 seconds its only happening with btcguild though, not any of the other mining pools i tested with. anyone come accross this before ?? Win7  Intel Dual Core Nvidia GTX470OC  
						 | 
					 
					
						| 
							
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								| 
							ancow
							
						 | 
								
							
								  | 
								
												
												 June 15, 2011, 11:20:15 AM  | 
										  
								 | 
							  
									  
							Shameless bump here... I've still been unable to get cpuminer to work on btcguild, no matter what settings i choose, the silly thing still throws the errors.... is anyone using cpu miner on btcguild ? Im now discovering a different issue   minerd.exe --algo cryptopp_asm32 --s 2 --url  http://btcguild.com/ --userpass xxxx:xxx this runs when i tried it on deepbit, local miner and a few others.... however on btcguild i get the following error [2011-06-12 10:02:16] 1 miner threads started, using SHA256 'cryptopp_asm32' algorithm. [2011-06-12 10:02:20] JSON decode failed(1): '[' or '{' expected near '<' [2011-06-12 10:02:20] json_rpc_call failed, retry after 30 seconds its only happening with btcguild though, not any of the other mining pools i tested with. anyone come accross this before ?? Win7  Intel Dual Core Nvidia GTX470OC F:\CPU-miner>cd "F:\CPU-miner"
  F:\CPU-miner>minerd.exe --user djinfected --pass dji12406btio --url http://minin g.bitcoin.cz/ --algo 4way [2011-06-03 00:00:51] 1 miner threads started, using SHA256 '4way' algorithm. [2011-06-03 00:00:53] JSON decode failed(1): '[' or '{' expected near '<' [2011-06-03 00:00:53] json_rpc_call failed, retry after 30 seconds I don't understand what this means. I get this with the default algo too. It looks to me like you're getting an HTML response instead of a JSON one. Something to ask your pool admin about (or double-check the URL you're passing, especially if the pool doesn't use the standard port).
  Apart from the obvious "this has already been answered here", are you sure you know what you're doing? Setting the scantime to two seconds doesn't seem very prudent to me... (although that setting is probably ignored, assuming your pool supports long polling) And finally, such questions are better asked in the pool threads.  
						 | 
					 
					
						
							
							 
							BTC: 1GAHTMdBN4Yw3PU66sAmUBKSXy2qaq2SF4 
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							rocksalt
							
								Newbie 
								 
								  Offline
								Activity: 51 
								Merit: 0
								 
								
								
								
								
								 
							 
						 | 
								
							
								  | 
								
												
												 June 15, 2011, 11:34:25 AM  | 
										  
								 | 
							  
									  
							Shameless bump here... I've still been unable to get cpuminer to work on btcguild, no matter what settings i choose, the silly thing still throws the errors.... is anyone using cpu miner on btcguild ? Im now discovering a different issue   minerd.exe --algo cryptopp_asm32 --s 2 --url  http://btcguild.com/ --userpass xxxx:xxx this runs when i tried it on deepbit, local miner and a few others.... however on btcguild i get the following error [2011-06-12 10:02:16] 1 miner threads started, using SHA256 'cryptopp_asm32' algorithm. [2011-06-12 10:02:20] JSON decode failed(1): '[' or '{' expected near '<' [2011-06-12 10:02:20] json_rpc_call failed, retry after 30 seconds its only happening with btcguild though, not any of the other mining pools i tested with. anyone come accross this before ?? Win7  Intel Dual Core Nvidia GTX470OC F:\CPU-miner>cd "F:\CPU-miner"
  F:\CPU-miner>minerd.exe --user djinfected --pass dji12406btio --url http://minin g.bitcoin.cz/ --algo 4way [2011-06-03 00:00:51] 1 miner threads started, using SHA256 '4way' algorithm. [2011-06-03 00:00:53] JSON decode failed(1): '[' or '{' expected near '<' [2011-06-03 00:00:53] json_rpc_call failed, retry after 30 seconds I don't understand what this means. I get this with the default algo too. It looks to me like you're getting an HTML response instead of a JSON one. Something to ask your pool admin about (or double-check the URL you're passing, especially if the pool doesn't use the standard port).
  Apart from the obvious "this has already been answered here", are you sure you know what you're doing? Setting the scantime to two seconds doesn't seem very prudent to me... (although that setting is probably ignored, assuming your pool supports long polling) And finally, such questions are better asked in the pool threads. yeah i know 2 seconds is quite aggressive, I've tested it all the way up to 10 seconds in 2 sec intervals.. hasn't made any impact when I've used to against other pools i've tried btcguild.com:8332 and  ipaddress:8332... still the html response thing.. ill follow that up and see... interestingly, im also getting something similar with bitcoin-miner, so im now assuming its a pool issue and not a minder issue. Thanks for the help though     
						 | 
					 
					
						| 
							
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							jgarzik (OP)
							
								Legendary 
								 
								  Offline
								Activity: 1596 
								Merit: 1145
								
								
								
								
								  
							 
						 | 
								
							
								  | 
								
												
												 June 15, 2011, 06:33:23 PM  | 
										  
								 | 
							  
									  
							Setting scantime far too low will probably cost you money.  At some point overhead becomes more significant than hashing, as cpuminer is not fully pipelined. 
						 | 
					 
					
						
							
							 
							Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own. Visit bloq.com / metronome.io Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj 
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								| 
							d3m0n1q_733rz
							
						 | 
								
							
								  | 
								
												
												 June 17, 2011, 04:36:39 AM  | 
										  
								 | 
							  
									  
							So then, what exactly does scantime do?  It says that my CPU cores are performing around the same computations per sec.  From what I can tell, it only changes how often it tells me how many computations it has computed.  Am I missing something here?  And, if so, what is generally a good value to set this for?  I have it set for about 15 sec and it seems to be working well. 
						 | 
					 
					
						
							
							 
							Funroll_Loops, the theoretically quicker breakfast cereal! Check out  http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq  
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							cynikal
							
								Newbie 
								 
								  Offline
								Activity: 14 
								Merit: 0
								
								
								
								
								  
							 
						 | 
								
							
								  | 
								
												
												 June 17, 2011, 05:07:40 PM  | 
										  
								 | 
							  
									  
							pardon the n00b question but, does cpuminer have any facility to detect when the current block's been solved (so that it can drop what it's doing and begin new getwork() or is that what the scan time discussion is addressing?
  i'm wondering if cpuminer is (or can be made) intelligent enough to not continue to working on the old block, submitting stale shares somehow.. (i'm thinking of setting up pushpoold if that'd help). 
						 | 
					 
					
						| 
							
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							dserrano5
							
								Legendary 
								 
								  Offline
								Activity: 1974 
								Merit: 1038
								 
								
								
								
								
								 
							 
						 | 
								
							
								  | 
								
												
												 June 17, 2011, 06:11:21 PM  | 
										  
								 | 
							  
									  
							Yes. That's called "long polling". In cpuminer's output, the lines "LONGPOLL detected new block" tell that a block has been solved. 
						 | 
					 
					
						| 
							
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								| 
							ancow
							
						 | 
								
							
								  | 
								
												
												 June 17, 2011, 06:28:25 PM  | 
										  
								 | 
							  
									  
							So then, what exactly does scantime do?  It says that my CPU cores are performing around the same computations per sec.  From what I can tell, it only changes how often it tells me how many computations it has computed.  Am I missing something here?  And, if so, what is generally a good value to set this for?  I have it set for about 15 sec and it seems to be working well.
  It determines the amount of time spent on whatever work the server sent you and is only relevant when you're not using long polling. The point behind scantime is that if you find a share for a block that has been solved, the share is wasted. So for a server that doesn't support long polling (i.e. telling you when a block is solved), you're getting an arbitrarily large amount of work. And the longer it takes for you to solve the block, the higher the chances for finding a stale share. The problem with such low values is pretty much that you're increasing the network load for yourself and the server and the server's overall load because it has to calculate another workload for you. Basically, with a scantime of 2s you're doing a (really small) DOS attack on the server. Also, you're spending some of your CPU resources on getting the work, etc., so you're wasting valuable hashing power on overhead.  
						 | 
					 
					
						
							
							 
							BTC: 1GAHTMdBN4Yw3PU66sAmUBKSXy2qaq2SF4 
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								| 
							d3m0n1q_733rz
							
						 | 
								
							
								  | 
								
												
												 June 20, 2011, 06:42:40 AM  | 
										  
								 | 
							  
									  
							Hey, I was looking through the Ufasoft SSE2_64 code to see if I could make any SSE updates and I'm having difficulty understanding some of it since it's not commented.  I was wondering if you might be able to help me out.  I don't really know the rules for SSE4.1's movntdqa command and seem to have made a boo-boo.  Here's the code I've modified so far and tested compilation for which didn't work.  Could you point out my error? LAB_CALC: 	movntdqa xmmword ptr [edi], [r11-15*16] 	movdqa xmm0, xmmword ptr [edi] 	movdqa	xmm2, xmmword ptr [edi]					; (Rotr32(w_15, 7) ^ Rotr32(w_15, 18) ^ (w_15 >> 3)) 	 	psrld	xmm0, 3 	movdqa	xmm1, xmm0 	pslld	xmm2, 14 	psrld	xmm1, 4 	pxor	xmm0, xmm1 	pxor	xmm0, xmm2 	pslld	xmm2, 11 	psrld	xmm1, 11 	pxor	xmm0, xmm1 	pxor	xmm0, xmm2 I think it's because I'm trying to move the value of [r11-15*16] into the cache which is a round-about way of performing an operation on the data which may not be permitted.  Thanks! I wanted to add that I'm pretty new to coding as well, and the Intel data sheet is clear as mud on specifics and anything that isn't literal.  So I'm sorry if my code has something literal in it that shouldn't be.  
						 | 
					 
					
						
							
							 
							Funroll_Loops, the theoretically quicker breakfast cereal! Check out  http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq  
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								| 
							d3m0n1q_733rz
							
						 | 
								
							
								  | 
								
												
												 June 20, 2011, 08:46:10 AM  | 
										  
								 | 
							  
									  
							Well, I think I've realized some of them are going to be 32 bit values instead of 16, but even with the code modified to allow for it, I'm running into problems.  I want to toss as many of the operations into the cache efficiently as I can.  I also realized that I'll need to initialize two 128 bit caches to make room for all 10 xmm values.  I tell you, I'm realizing that 64-bit programming is a brand new ballgame for me.  But I said I would try tossing the code into the buffer and that's what I'm going to do even  ;if it kills me fi. if it takes all night fi.      hehe batch When you start mixing batch and assembly, you know you need sleep.  
						 | 
					 
					
						
							
							 
							Funroll_Loops, the theoretically quicker breakfast cereal! Check out  http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq  
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								| 
							d3m0n1q_733rz
							
						 | 
								
							
								  | 
								
												
												 June 20, 2011, 06:19:24 PM  | 
										  
								 | 
							  
									  
							Hey guys, it's a major work in progress, but I seem to be getting a segmentation fault I was wondering if someone could point out to me.  I'm trying to optimize some of the xmm moves through the edi/esi cache, but I'm a noob at this so it's more for fun and learning at the moment than anything else.  The code requires SSE4.1 to run correctly.  I have SSE4.1 in case someone asks, so that's not the problem. I've tried using LFENCE since it's going to be multi-threaded, but I've probably made a noob mistake there too.  Anyhow, here's my non-working code for the moment.  I've made changes to LAB_CALC, loading the init values into the hash, and stopped part-way through LAB_LOOP.  It's still very much a work in progress, but I expect to see at least some speed-up once I get it working. ;; SHA-256 for X86-64 for Linux, based off of:
  ; (c) Ufasoft 2011 http://ufasoft.com mailto:support@ufasoft.com ; Version 2011 ; This software is Public Domain
  ; SHA-256 CPU SSE cruncher for Bitcoin Miner
  ALIGN 32 BITS 64
  %define hash rdi %define data rsi %define init rdx
  extern g_4sha256_k
  global CalcSha256_x64	 ;	CalcSha256	hash(rdi), data(rsi), init(rdx) CalcSha256_x64:	
  	push	rbx
  LAB_NEXT_NONCE: 	mov	r11, data ;	mov	rax, pnonce ;	mov	eax, [rax] ;	mov	[rbx+3*16], eax ;	inc	eax ;	mov	[rbx+3*16+4], eax ;	inc	eax ;	mov	[rbx+3*16+8], eax ;	inc	eax ;	mov	[rbx+3*16+12], eax
  	mov	rcx, 64*4 ;rcx is # of SHA-2 rounds 	mov	rax, 16*4 ;rax is where we expand to
  LAB_SHA: 	push	rcx 	lea	rcx, qword [r11+rcx*4] 	lea	r11, qword [r11+rax*4] LAB_CALC: 	LFENCE 	movdqa	xmm0, [r11-15*16] 	movdqa	[edi], xmm0 				; (Rotr32(w_15, 7) ^ Rotr32(w_15, 18) ^ (w_15 >> 3)) 	psrld	xmm0, 3 	movdqa	[edi+32], xmm0 	movntdqa	xmm2, [esi] 	movntdqa	xmm1, [esi+32] 	pslld	xmm2, 14 	psrld	xmm1, 4 	pxor	xmm0, xmm1 	pxor	xmm0, xmm2 	pslld	xmm2, 11 	psrld	xmm1, 11 	pxor	xmm0, xmm1 	pxor	xmm0, xmm2
  	paddd	xmm0, [r11-16*16]
  	movdqa	xmm3, [r11-2*16] 	movdqa	xmm2, xmm3					; (Rotr32(w_2, 17) ^ Rotr32(w_2, 19) ^ (w_2 >> 10)) 	psrld	xmm3, 10 	movdqa	xmm1, xmm3 	pslld	xmm2, 13 	psrld	xmm1, 7 	pxor	xmm3, xmm1 	pxor	xmm3, xmm2 	pslld	xmm2, 2 	psrld	xmm1, 2 	pxor	xmm3, xmm1 	pxor	xmm3, xmm2 	paddd	xmm0, xmm3 	 	paddd	xmm0, [r11-7*16] 	movdqa	[r11], xmm0 	add	r11, 16 	cmp	r11, rcx 	jb	LAB_CALC 	pop	rcx
  	mov rax, 0
  ; Load the init values of the message into the hash.
  	movd	xmm0, dword [rdx+4*4]		; xmm0 == e 	pshufd  xmm0, xmm0, 0 	movdqa	[edi], xmm0 	movd	xmm3, dword [rdx+3*4]		; xmm3 == d 	pshufd  xmm3, xmm3, 0 	movdqa [edi+32], xmm3 	movd	xmm4, dword [rdx+2*4]		; xmm4 == c 	pshufd  xmm4, xmm4, 0 	movdqa	[edi+64], xmm4 	movd	xmm5, dword [rdx+1*4]		; xmm5 == b 	pshufd  xmm5, xmm5, 0 	movdqa	[edi+96], xmm5 	movd	xmm7, dword [rdx+0*4]		; xmm7 == a 	pshufd  xmm7, xmm7, 0 	movdqa	[edi+112], xmm7 	movd	xmm8, dword [rdx+5*4]		; xmm8 == f 	pshufd  xmm8, xmm8, 0 	movdqa	[edi+160], xmm8 	movd	xmm9, dword [rdx+6*4]		; xmm9 == g 	pshufd  xmm9, xmm9, 0 	movdqa	[edi+192], xmm9 	movd	xmm10, dword [rdx+7*4]		; xmm10 == h 	pshufd  xmm10, xmm10, 0 	movdqa	[edi+224], xmm10
  LAB_LOOP:
  ;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32<T>(g_sha256_k[j]) + w[j]
  	movdqa	xmm6, [rsi+rax*4] 	paddd	xmm6, g_4sha256_k[rax*4] 	add	rax, 4
  	paddd	xmm6, xmm10	; +h
  	movntdqa	xmm1, [esi] 	movntdqa	xmm2, [esi+192] 	pandn	xmm1, xmm2	; ~e & g
  	movdqa	[edi+96], xmm2	; makes xmm2 the cache location in place of xmm9 	movntdqa	xmm10, [esi+192]	; h = g 	movntdqa	xmm2, [esi+160]	; f 	movntdqa	xmm9, [esi+160]	; g = f
  	pand	xmm2, xmm0	; e & f 	pxor	xmm1, xmm2	; (e & f) ^ (~e & g) 	movdqa	xmm8, xmm0	; f = e
  	paddd	xmm6, xmm1	; Ch + h + w[i] + k[i]
  	movdqa	xmm1, xmm0 	psrld	xmm0, 6 	movdqa	xmm2, xmm0 	pslld	xmm1, 7 	psrld	xmm2, 5 	pxor	xmm0, xmm1 	pxor	xmm0, xmm2 	pslld	xmm1, 14 	psrld	xmm2, 14 	pxor	xmm0, xmm1 	pxor	xmm0, xmm2 	pslld	xmm1, 5 	pxor	xmm0, xmm1	; Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25) 	paddd	xmm6, xmm0	; xmm6 = t1
  	movdqa	xmm0, xmm3	; d 	paddd	xmm0, xmm6	; e = d+t1
  	movdqa	xmm1, xmm5	; =b 	movdqa	xmm3, xmm4	; d = c 	movdqa	xmm2, xmm4	; c 	pand	xmm2, xmm5	; b & c 	pand	xmm4, xmm7	; a & c 	pand	xmm1, xmm7	; a & b 	pxor	xmm1, xmm4 	movdqa	xmm4, xmm5	; c = b 	movdqa	xmm5, xmm7	; b = a 	pxor	xmm1, xmm2	; (a & c) ^ (a & d) ^ (c & d) 	paddd	xmm6, xmm1	; t1 + ((a & c) ^ (a & d) ^ (c & d)) 		 	movdqa	xmm2, xmm7 	psrld	xmm7, 2 	movdqa	xmm1, xmm7	 	pslld	xmm2, 10 	psrld	xmm1, 11 	pxor	xmm7, xmm2 	pxor	xmm7, xmm1 	pslld	xmm2, 9 	psrld	xmm1, 9 	pxor	xmm7, xmm2 	pxor	xmm7, xmm1 	pslld	xmm2, 11 	pxor	xmm7, xmm2 	paddd	xmm7, xmm6	; a = t1 + (Rotr32(a, 2) ^ Rotr32(a, 13) ^ Rotr32(a, 22)) + ((a & c) ^ (a & d) ^ (c & d));	
  	cmp	rax, rcx 	jb	LAB_LOOP
  ; Finished the 64 rounds, calculate hash and save
  	movd	xmm1, dword [rdx+0*4] 	pshufd  xmm1, xmm1, 0 	paddd	xmm7, xmm1
  	movd	xmm1, dword [rdx+1*4] 	pshufd  xmm1, xmm1, 0 	paddd	xmm5, xmm1
  	movd	xmm1, dword [rdx+2*4] 	pshufd  xmm1, xmm1, 0 	paddd	xmm4, xmm1
  	movd	xmm1, dword [rdx+3*4] 	pshufd  xmm1, xmm1, 0 	paddd	xmm3, xmm1
  	movd	xmm1, dword [rdx+4*4] 	pshufd  xmm1, xmm1, 0 	paddd	xmm0, xmm1
  	movd	xmm1, dword [rdx+5*4] 	pshufd  xmm1, xmm1, 0 	paddd	xmm8, xmm1
  	movd	xmm1, dword [rdx+6*4] 	pshufd  xmm1, xmm1, 0 	paddd	xmm9, xmm1
  	movd	xmm1, dword [rdx+7*4] 	pshufd  xmm1, xmm1, 0 	paddd	xmm10, xmm1
  debug_me: 	movdqa	[rdi+0*16], xmm7	 	movdqa	[rdi+1*16], xmm5	 	movdqa	[rdi+2*16], xmm4 	movdqa	[rdi+3*16], xmm3 	movdqa	[rdi+4*16], xmm0 	movdqa	[rdi+5*16], xmm8 	movdqa	[rdi+6*16], xmm9	 	movdqa	[rdi+7*16], xmm10
  LAB_RET: 	pop	rbx 	ret Mind you, it does compile so it's not THAT bad anymore.  I figured out that Linux code is much simpler than Windows.  
						 | 
					 
					
						
							
							 
							Funroll_Loops, the theoretically quicker breakfast cereal! Check out  http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq  
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							jgarzik (OP)
							
								Legendary 
								 
								  Offline
								Activity: 1596 
								Merit: 1145
								
								
								
								
								  
							 
						 | 
								
							
								  | 
								
												
												 June 20, 2011, 06:38:13 PM  | 
										  
								 | 
							  
									  
							A user randomly emailed the following sha256 core update: http://yyz.us/bitcoin/sha256_xmm_amd64_atom.asmJeff - attached is a somewhat faster sse2_64 core, well, at least for the cpu's I've tested!
  An example on an Intel Atom D525 (dual core),
  [2011-06-14 14:18:42] 2 miner threads started, using SHA256 'sse2_64'algorithm. [2011-06-14 14:18:56] thread 0: 16777216 hashes, 1047.98 khash/sec
  [2011-06-14 14:18:19] 2 miner threads started, using SHA256 'sse2_64_atom' algorithm. [2011-06-14 14:18:31] thread 0: 16777216 hashes, 1234.20 khash/sec
  It should be faster on all Intel cpu's by quite some margin, up to 20% in my tests.
  Anybody want to test this, and prove his assertions?  
						 | 
					 
					
						
							
							 
							Jeff Garzik, Bloq CEO, former bitcoin core dev team; opinions are my own. Visit bloq.com / metronome.io Donations / tip jar: 1BrufViLKnSWtuWGkryPsKsxonV2NQ7Tcj 
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							dserrano5
							
								Legendary 
								 
								  Offline
								Activity: 1974 
								Merit: 1038
								 
								
								
								
								
								 
							 
						 | 
								
							
								  | 
								
												
												 June 20, 2011, 07:35:11 PM  | 
										  
								 | 
							  
									  
							Sorry, that's a bit beyond me: $ gcc -c sha256_xmm_amd64_atom.asm gcc: sha256_xmm_amd64_atom.asm: linker input file unused because linking not done
  $ yasm !$ yasm sha256_xmm_amd64_atom.asm sha256_xmm_amd64_atom.asm:26: warning: binary object format does not support extern variables sha256_xmm_amd64_atom.asm:28: warning: binary object format does not support global variables sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references sha256_xmm_amd64_atom.asm:216: error: binary object format does not support external references
  $ gcc sha256_xmm_amd64_atom.asm  /usr/bin/ld:sha256_xmm_amd64_atom.asm: file format not recognized; treating as linker script /usr/bin/ld:sha256_xmm_amd64_atom.asm:1: syntax error collect2: ld returned 1 exit status
  $ mv sha256_xmm_amd64_atom.asm sha256_xmm_amd64_atom.s $ gcc sha256_xmm_amd64_atom.s sha256_xmm_amd64_atom.s: Assembler messages: sha256_xmm_amd64_atom.s:1: Error: no such instruction: `sha-256 for X86-64 for Linux,based off of:' sha256_xmm_amd64_atom.s:3: Error: junk at end of line, first unrecognized character is `(' sha256_xmm_amd64_atom.s:4: Error: no such instruction: `version 2011' [... some screenfuls ...]
  $ mv sha256_xmm_amd64_atom.s sha256_xmm_amd64_atom.S $ gcc sha256_xmm_amd64_atom.S  [identical] $ as sha256_xmm_amd64_atom.S  [identical]
  $ as --version GNU assembler (GNU Binutils for Ubuntu) 2.21.0.20110327 Copyright 2011 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or later. This program has absolutely no warranty. This assembler was configured for a target of `x86_64-linux-gnu'.  
						 | 
					 
					
						| 
							
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								| 
							d3m0n1q_733rz
							
						 | 
								
							
								  | 
								
												
												 June 20, 2011, 07:42:51 PM Last edit: June 20, 2011, 07:54:45 PM by d3m0n1q_733rz  | 
										  
								 | 
							  
									  
							Verified.  Since it tried to reference the original asm still, I removed the _atom from the name and the references in the code before I compiled so it would take the place of my SSE2_64 easier.  But yes, I'm seeing a 300 khash/sec increase from around 3400. ;; SHA-256 for X86-64 for Linux, based off of:
  ; (c) Ufasoft 2011 http://ufasoft.com mailto:support@ufasoft.com ; Version 2011 ; This software is Public Domain
  ; Significant re-write/optimisation and reordering by, ; Neil Kettle <mu-b@digit-labs.org> ; ~18% performance improvement
  ; SHA-256 CPU SSE cruncher for Bitcoin Miner
  ALIGN 32 BITS 64
  %define hash rdi %define data rsi %define init rdx
  ; 0 = (1024 - 256) (mod (LAB_CALC_UNROLL*LAB_CALC_PARA*16)) %define LAB_CALC_PARA	2 %define LAB_CALC_UNROLL	8
  %define LAB_LOOP_UNROLL 8
  extern g_4sha256_k
  global CalcSha256_x64 ;	CalcSha256	hash(rdi), data(rsi), init(rdx) CalcSha256_x64:
  	push	rbx
  LAB_NEXT_NONCE:
  	mov	rcx, 64*4					; 256 - rcx is # of SHA-2 rounds 	mov	rax, 16*4					; 64 - rax is where we expand to
  LAB_SHA: 	push	rcx 	lea	rcx, qword [data+rcx*4]				; + 1024 	lea	r11, qword [data+rax*4]				; + 256
  LAB_CALC: %macro	lab_calc_blk 1 	movdqa	xmm0, [r11-(15-%1)*16]				; xmm0 = W[I-15] 	movdqa	xmm4, [r11-(15-(%1+1))*16]			; xmm4 = W[I-15+1] 	movdqa	xmm2, xmm0					; xmm2 = W[I-15] 	movdqa	xmm6, xmm4					; xmm6 = W[I-15+1] 	psrld	xmm0, 3						; xmm0 = W[I-15] >> 3 	psrld	xmm4, 3						; xmm4 = W[I-15+1] >> 3 	movdqa	xmm1, xmm0					; xmm1 = W[I-15] >> 3 	movdqa	xmm5, xmm4					; xmm5 = W[I-15+1] >> 3 	pslld	xmm2, 14					; xmm2 = W[I-15] << 14 	pslld	xmm6, 14					; xmm6 = W[I-15+1] << 14 	psrld	xmm1, 4						; xmm1 = W[I-15] >> 7 	psrld	xmm5, 4						; xmm5 = W[I-15+1] >> 7 	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) 	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) 	psrld	xmm1, 11					; xmm1 = W[I-15] >> 18 	psrld	xmm5, 11					; xmm5 = W[I-15+1] >> 18 	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) 	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) 	pslld	xmm2, 11					; xmm2 = W[I-15] << 25 	pslld	xmm6, 11					; xmm6 = W[I-15+1] << 25 	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18) 	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18) 	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18) ^ (W[I-15] << 25) 	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18) ^ (W[I-15+1] << 25)
  	movdqa	xmm3, [r11-(2-%1)*16]				; xmm3 = W[I-2] 	movdqa	xmm7, [r11-(2-(%1+1))*16]			; xmm7 = W[I-2+1]
  	paddd	xmm0, [r11-(16-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16] 	paddd	xmm4, [r11-(16-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1]
  ;;;;;;;;;;;;;;;;;;
  	movdqa	xmm2, xmm3					; xmm2 = W[I-2] 	movdqa	xmm6, xmm7					; xmm6 = W[I-2+1] 	psrld	xmm3, 10					; xmm3 = W[I-2] >> 10 	psrld	xmm7, 10					; xmm7 = W[I-2+1] >> 10 	movdqa	xmm1, xmm3					; xmm1 = W[I-2] >> 10 	movdqa	xmm5, xmm7					; xmm5 = W[I-2+1] >> 10
  	paddd	xmm0, [r11-(7-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16] + W[I-7]
  	pslld	xmm2, 13					; xmm2 = W[I-2] << 13 	pslld	xmm6, 13					; xmm6 = W[I-2+1] << 13 	psrld	xmm1, 7						; xmm1 = W[I-2] >> 17 	psrld	xmm5, 7						; xmm5 = W[I-2+1] >> 17
  	paddd	xmm4, [r11-(7-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1] + W[I-7+1]
  	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) 	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) 	psrld	xmm1, 2						; xmm1 = W[I-2] >> 19 	psrld	xmm5, 2						; xmm5 = W[I-2+1] >> 19 	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) 	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) 	pslld	xmm2, 2						; xmm2 = W[I-2] << 15 	pslld	xmm6, 2						; xmm6 = W[I-2+1] << 15 	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19) 	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19) 	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19) ^ (W[I-2] << 15) 	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19) ^ (W[I-2+1] << 15)
  	paddd	xmm0, xmm3					; xmm0 = s0(W[I-15]) + W[I-16] + s1(W[I-2]) + W[I-7] 	paddd	xmm4, xmm7					; xmm4 = s0(W[I-15+1]) + W[I-16+1] + s1(W[I-2+1]) + W[I-7+1] 	movdqa	[r11+(%1*16)], xmm0 	movdqa	[r11+((%1+1)*16)], xmm4 %endmacro
  %assign i 0 %rep    LAB_CALC_UNROLL         lab_calc_blk i %assign i i+LAB_CALC_PARA %endrep
  	add	r11, LAB_CALC_UNROLL*LAB_CALC_PARA*16 	cmp	r11, rcx 	jb	LAB_CALC
  	pop	rcx 	mov	rax, 0
  ; Load the init values of the message into the hash.
  	movdqa	xmm7, [init] 	pshufd	xmm5, xmm7, 0x55		; xmm5 == b 	pshufd	xmm4, xmm7, 0xAA		; xmm4 == c 	pshufd	xmm3, xmm7, 0xFF		; xmm3 == d 	pshufd	xmm7, xmm7, 0			; xmm7 == a
  	movdqa	xmm0, [init+4*4] 	pshufd	xmm8, xmm0, 0x55		; xmm8 == f 	pshufd	xmm9, xmm0, 0xAA		; xmm9 == g 	pshufd	xmm10, xmm0, 0xFF		; xmm10 == h 	pshufd	xmm0, xmm0, 0			; xmm0 == e
  LAB_LOOP:
  ;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32<T>(g_sha256_k[j]) + w[j]
  %macro	lab_loop_blk 0 	movdqa	xmm6, [data+rax*4] 	paddd	xmm6, g_4sha256_k[rax*4] 	add	rax, 4
  	paddd	xmm6, xmm10	; +h
  	movdqa	xmm1, xmm0 	movdqa	xmm2, xmm9 	pandn	xmm1, xmm2	; ~e & g
  	movdqa	xmm10, xmm2	; h = g 	movdqa	xmm2, xmm8	; f 	movdqa	xmm9, xmm2	; g = f
  	pand	xmm2, xmm0	; e & f 	pxor	xmm1, xmm2	; (e & f) ^ (~e & g) 	movdqa	xmm8, xmm0	; f = e
  	paddd	xmm6, xmm1	; Ch + h + w[i] + k[i]
  	movdqa	xmm1, xmm0 	psrld	xmm0, 6 	movdqa	xmm2, xmm0 	pslld	xmm1, 7 	psrld	xmm2, 5 	pxor	xmm0, xmm1 	pxor	xmm0, xmm2 	pslld	xmm1, 14 	psrld	xmm2, 14 	pxor	xmm0, xmm1 	pxor	xmm0, xmm2 	pslld	xmm1, 5 	pxor	xmm0, xmm1	; Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25) 	paddd	xmm6, xmm0	; xmm6 = t1
  	movdqa	xmm0, xmm3	; d 	paddd	xmm0, xmm6	; e = d+t1
  	movdqa	xmm1, xmm5	; =b 	movdqa	xmm3, xmm4	; d = c 	movdqa	xmm2, xmm4	; c 	pand	xmm2, xmm5	; b & c 	pand	xmm4, xmm7	; a & c 	pand	xmm1, xmm7	; a & b 	pxor	xmm1, xmm4 	movdqa	xmm4, xmm5	; c = b 	movdqa	xmm5, xmm7	; b = a 	pxor	xmm1, xmm2	; (a & c) ^ (a & d) ^ (c & d) 	paddd	xmm6, xmm1	; t1 + ((a & c) ^ (a & d) ^ (c & d))
  	movdqa	xmm2, xmm7 	psrld	xmm7, 2 	movdqa	xmm1, xmm7 	pslld	xmm2, 10 	psrld	xmm1, 11 	pxor	xmm7, xmm2 	pxor	xmm7, xmm1 	pslld	xmm2, 9 	psrld	xmm1, 9 	pxor	xmm7, xmm2 	pxor	xmm7, xmm1 	pslld	xmm2, 11 	pxor	xmm7, xmm2 	paddd	xmm7, xmm6	; a = t1 + (Rotr32(a, 2) ^ Rotr32(a, 13) ^ Rotr32(a, 22)) + ((a & c) ^ (a & d) ^ (c & d)); %endmacro
  %assign i 0 %rep    LAB_LOOP_UNROLL         lab_loop_blk %assign i i+1 %endrep
  	cmp	rax, rcx 	jb	LAB_LOOP
  ; Finished the 64 rounds, calculate hash and save
  	movdqa	xmm1, [rdx] 	pshufd	xmm2, xmm1, 0x55 	pshufd	xmm6, xmm1, 0xAA 	pshufd	xmm11, xmm1, 0xFF 	pshufd	xmm1, xmm1, 0
  	paddd	xmm5, xmm2 	paddd	xmm4, xmm6 	paddd	xmm3, xmm11 	paddd	xmm7, xmm1
  	movdqa	xmm1, [rdx+4*4] 	pshufd	xmm2, xmm1, 0x55 	pshufd	xmm6, xmm1, 0xAA 	pshufd	xmm11, xmm1, 0xFF 	pshufd	xmm1, xmm1, 0
  	paddd	xmm8, xmm2 	paddd	xmm9, xmm6 	paddd	xmm10, xmm11 	paddd	xmm0, xmm1
  	movdqa	[hash+0*16], xmm7 	movdqa	[hash+1*16], xmm5 	movdqa	[hash+2*16], xmm4 	movdqa	[hash+3*16], xmm3 	movdqa	[hash+4*16], xmm0 	movdqa	[hash+5*16], xmm8 	movdqa	[hash+6*16], xmm9 	movdqa	[hash+7*16], xmm10
  LAB_RET: 	pop	rbx 	ret
  I notice that it doesn't rely as heavily on moving the quad-words around from xmm to xmm.  But if there's some way of moving some of those into the processor cache, as I was trying to do, I think they can still be write combined which would speed up hashing just a smidge more.  But, again, I'm still a noob at these more recent coding techniques.  
						 | 
					 
					
						
							
							 
							Funroll_Loops, the theoretically quicker breakfast cereal! Check out  http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq  
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								| 
							d3m0n1q_733rz
							
						 | 
								
							
								  | 
								
												
												 June 20, 2011, 08:17:10 PM  | 
										  
								 | 
							  
									  
							Tested on another machine and I'm seeing an increase from about 1500 to around 1750. 
						 | 
					 
					
						
							
							 
							Funroll_Loops, the theoretically quicker breakfast cereal! Check out  http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq  
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							LehmanSister
							
								Member 
								  
								  Offline
								Activity: 68 
								Merit: 10
								 
								High Desert Dweller-Where Space and Time Meet $
								
								
								
								
								
								  
							 
						 | 
								
							
								  | 
								
												
												 June 21, 2011, 09:33:19 AM Last edit: June 21, 2011, 08:45:17 PM by lehmansister  | 
										  
								 | 
							  
									  
							I see the 20% boost on Atom's for sure. git clone git://github.com/jgarzik/cpuminer.git wget -O cpuminer/x86_64/sha256_xmm_amd64_atom.asm http://yyz.us/bitcoin/sha256_xmm_amd64_atom.asm cd cpuminer ./automake.sh ./configure make all
  Note: yasm 1.0 isn't in debian stable yet. [Edit: Ooops, I was doing quite a few things, I think I did the "_atom" strip as well]  
						 | 
					 
					
						
							
							 
							ISO: small island nations with large native populations excited to pay tribute to flying gods, will trade BTC. 
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							theowalpott
							
								Member 
								  
								  Offline
								Activity: 80 
								Merit: 10
								
								
								
								
								  
							 
						 | 
								
							
								  | 
								
												
												 June 21, 2011, 06:03:16 PM Last edit: June 21, 2011, 10:03:25 PM by theowalpott  | 
										  
								 | 
							  
									  
							Getting a seg fault for the sha256 cryptopp_asm32 option. I build from the git repo, using: ./autogen.sh CFLAGS="-O3 -Wall -msse2" ./configure make I run it for the btcguild (haven't tried a different pool though..) and I get this: [2011-06-21 18:54:36] JSON option quiet invalid [2011-06-21 18:54:37] Long-polling activated for  http://uscentral.btcguild.com:8332/LPSegmentation fault I tried using the 4way option, and got ~700 kh/s which seems quite slow for my cpu (quad xeon @ 2.83) - I get 1200 kh/s with the cryptoapp_asm32 option in an older build. I've just rebuilt using the default flags: -g -O2 This seems to work fine. Possibly something wrong in the usage of SSE2? I don't have a huge amount of experience with gcc, so can't be a lot more specific. EDIT: Compiles fine with CFLAGS="-O2 -Wall -msse2" switch to -O3 and it seg faults. I'm using gcc 4.4 btw  
						 | 
					 
					
						
							
							 
							1FwGATm6eU5dSiTp2rpazV5u3qwbx1fuDn 
						 | 
					 
				 
			 |  
		 
	 | 
		
		
			
				
					
								
							theowalpott
							
								Member 
								  
								  Offline
								Activity: 80 
								Merit: 10
								
								
								
								
								  
							 
						 | 
								
							
								  | 
								
												
												 June 21, 2011, 10:10:16 PM  | 
										  
								 | 
							  
									  
							Shameless bump here... I've still been unable to get cpuminer to work on btcguild, no matter what settings i choose, the silly thing still throws the errors.... is anyone using cpu miner on btcguild ? Im now discovering a different issue   minerd.exe --algo cryptopp_asm32 --s 2 --url  http://btcguild.com/ --userpass xxxx:xxx this runs when i tried it on deepbit, local miner and a few others.... however on btcguild i get the following error [2011-06-12 10:02:16] 1 miner threads started, using SHA256 'cryptopp_asm32' algorithm. [2011-06-12 10:02:20] JSON decode failed(1): '[' or '{' expected near '<' [2011-06-12 10:02:20] json_rpc_call failed, retry after 30 seconds its only happening with btcguild though, not any of the other mining pools i tested with. anyone come accross this before ?? Win7  Intel Dual Core Nvidia GTX470OC Try using the config files.. Create a file in the same directory as your minerd.exe called btcguild.json with the following inside: { 	"_comment1" : "Any long-format command line argument ", 	"_comment2" : "may be used in this JSON configuration file",
  	"url" : "http://uscentral.btcguild.com:8332", 	"user" : "USER_WORKER", 	"pass" : "PASSWORD",
  	"algo" : "cryptopp_asm32", 	"threads" : "4",
  	"quiet" : false }then start the miner with: minerd.exe --config btcguild.json Obviously you can choose more threads, or remove that line entirely if you want it to be handled automatically. Also, change the algorithm to whichever you prefer. Hope that helps    
						 | 
					 
					
						
							
							 
							1FwGATm6eU5dSiTp2rpazV5u3qwbx1fuDn 
						 | 
					 
				 
			 |  
		 
	 | 
	 |