Bitcoin Forum
November 16, 2024, 05:35:39 AM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [31] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 ... 843 »
  Print  
Author Topic: OFFICIAL CGMINER mining software thread for linux/win/osx/mips/arm/r-pi 4.11.1  (Read 5805634 times)
This is a self-moderated topic. If you do not want to be moderated by the person who started this topic, create a new topic. (3 posts by 1+ user deleted.)
Viceroy
Hero Member
*****
Offline Offline

Activity: 924
Merit: 501


View Profile
July 30, 2011, 01:44:50 PM
 #601

* ckolivas quietly replaces the tarballs with a fixed version
Ok, redownload  Lips sealed


LOL  Smiley

The00Dustin
Hero Member
*****
Offline Offline

Activity: 807
Merit: 500


View Profile
July 30, 2011, 02:35:07 PM
 #602

You can't figure out how to run cgminer in the directory you compiled it in? Sucks to be you, then...
LOL.... not that m8... in linux I would need to figure out, how to OC my cards! Tongue
(2x HD 5750, OC'd from 700MHz to 870MHz in windows with MSI Afterburner)
amdovdrvctrl works a treat for me and it's a gui app for those trying to escape windows
http://sourceforge.net/projects/amdovdrvctrl/
I'm using atitweak, which someone posted about here: http://forum.bitcoin.org/index.php?topic=25750.0  It's a simple install using easy_install and it's pretty simple.  It is important to know that if you have performance levels 0, 1, and 2 that you need to use -P 2 to prevent a lockup from sending a ridiculous settting to the lower performance levels that use less voltage (IOW, only overclock the top performance level, if you sent voltage too, overclocking all would work, but it would just be a waste of power).  I saw a thread about which overclocking tools people were using too, but I never read it and can't find it now.
Viceroy
Hero Member
*****
Offline Offline

Activity: 924
Merit: 501


View Profile
July 30, 2011, 03:15:04 PM
 #603

aticonfig --adapter=all --od-setclocks=900,300

aticonfig --adapter=all --odgt
aticonfig --adapter=all --odgc

to make setting permanent:
aticonfig --adapter=all --odcc
zaytsev
Newbie
*
Offline Offline

Activity: 59
Merit: 0


View Profile
July 30, 2011, 03:20:19 PM
 #604

+1, I've been using this very successfully so far, but be careful, because it allows you to supply the values outside the range without confirmation. You can brick your card this way if you are not attentive enough.
-ck (OP)
Legendary
*
Offline Offline

Activity: 4284
Merit: 1645


Ruu \o/


View Profile WWW
July 30, 2011, 03:23:42 PM
 #605

Also be aware that it may silently pretend to set the values but not actually do so. Although my 6970s report ram speeds possible of 320-1450, if I set them to anything below 825, it actually just resets them to normal values.

Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel
2% Fee Solo mining at solo.ckpool.org
-ck
PLaci1982
Full Member
***
Offline Offline

Activity: 168
Merit: 100


Live long and prosper. \\//,


View Profile
July 30, 2011, 03:30:37 PM
 #606

I did gave a try to atitweak, but it saw only 1 of 2 cards...
I don't want to play around with xorg.conf, and I don't have dummy plugs...

You can't figure out how to run cgminer in the directory you compiled it in? Sucks to be you, then...
LOL.... not that m8... in linux I would need to figure out, how to OC my cards! Tongue
(2x HD 5750, OC'd from 700MHz to 870MHz in windows with MSI Afterburner)
That explains that, then... Cheesy
Don't you have some other rig with a maybe slower card you can experiment on? Anyways, I assume there are enough people in these forums that had your problem that there will be a solution posted somewhere if you take the time to research it - and I doubt you need to have Linux booted for that.

I have one another PC with a HD 2600 XT, and one with a HD 4550. The 1st ain't compatible with the SDK, the second ain't worth to mine with and that PC also does run 7/24...

Hardware Expert / WinXP, Win7 Expert

1J5oPkyGVdb4mv44KGZQYsHS2ch6e1t4rc
GenTarkin
Legendary
*
Offline Offline

Activity: 2450
Merit: 1002


View Profile
July 30, 2011, 05:19:11 PM
 #607

Hello, I have to say I love CGMINER it gets 4mh/s more on my 6950 than any config of GUIminer does. Im using windows.

I am trying to set it up on my other PC which has no opencl compatible device. GUIminer has no issue mining on the cpu but when start CGMINER it says no devices found and no matter what cpu flags I throw at it, it refuses to run...any ideas?
Says "Error getting devices IDs (num)"


GenTarkin's MOD Kncminer Titan custom firmware! v1.0.4! -- !!NO LONGER AVAILABLE!!
Donations: bitcoin- 1Px71mWNQNKW19xuARqrmnbcem1dXqJ3At || litecoin- LYXrLis3ik6TRn8tdvzAyJ264DRvwYVeEw
miscreanity
Legendary
*
Offline Offline

Activity: 1316
Merit: 1005


View Profile
July 30, 2011, 05:49:14 PM
 #608

Also be aware that it may silently pretend to set the values but not actually do so. Although my 6970s report ram speeds possible of 320-1450, if I set them to anything below 825, it actually just resets them to normal values.

I had the same issue with the 69xx series. There can't be >125 Mhz difference between core and memory clock speeds. It wasn't exactly elegant, but the solution that seems to work consistently can be found here:

http://forums.extremeoverclocking.com/showthread.php?t=355592

Basically, use a FreeDOS USB stick loaded with atiflash.exe to boot and extract the GPU BIOS. Then either use a Windows XP/Vista/7 system or a virtual session to run TechPowerUp's Radeon BIOS Editor. With it, you can set the memory clock speeds and save the updated BIOS. Reboot with the USB stick, flash the GPU BIOS and reboot a final time.

After that relatively painless process, I was able to use any method to underclock memory. It was well worth it, as my cards are running at the same Mh rate but ~7C cooler with memory underclocked and all other settings the same as before. I was even able to reliably raise my core speeds a bit past where they used to fail.


I did gave a try to atitweak, but it saw only 1 of 2 cards...
I don't want to play around with xorg.conf, and I don't have dummy plugs...

I have one another PC with a HD 2600 XT, and one with a HD 4550. The 1st ain't compatible with the SDK, the second ain't worth to mine with and that PC also does run 7/24...

Dummy plugs aren't necessary with Linux. If atitweak doesn't see both cards (assuming they both support CL and work with another OS), try aticonfig. Re-seat the cards, try swapping slots, install one card on the board at a time, etc... it could be a hardware or power problem. I haven't used Windows outside of virtual sessions for years, so I don't know if it's still more tolerant (read: lax) about hardware issues than Linux.
PLaci1982
Full Member
***
Offline Offline

Activity: 168
Merit: 100


Live long and prosper. \\//,


View Profile
July 30, 2011, 05:59:58 PM
 #609

I did gave a try to atitweak, but it saw only 1 of 2 cards...
I don't want to play around with xorg.conf, and I don't have dummy plugs...

I have one another PC with a HD 2600 XT, and one with a HD 4550. The 1st ain't compatible with the SDK, the second ain't worth to mine with and that PC also does run 7/24...

Dummy plugs aren't necessary with Linux. If atitweak doesn't see both cards (assuming they both support CL and work with another OS), try aticonfig. Re-seat the cards, try swapping slots, install one card on the board at a time, etc... it could be a hardware or power problem. I haven't used Windows outside of virtual sessions for years, so I don't know if it's still more tolerant (read: lax) about hardware issues than Linux./quote]
The same setup works 100% with Win7 without any problem...

Hardware Expert / WinXP, Win7 Expert

1J5oPkyGVdb4mv44KGZQYsHS2ch6e1t4rc
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
July 30, 2011, 10:56:51 PM
 #610

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to.  Here's what I've done so far.  It's not much, but it works.  Don't go changing the github source just yet though.  For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file.  For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

So here it is:
Code:
;; SHA-256 for X86-64 for Linux, based off of:

; (c) Ufasoft 2011 http://ufasoft.com mailto:support@ufasoft.com
; Version 2011
; This software is Public Domain

; Significant re-write/optimisation and reordering by,
; Neil Kettle <mu-b@digit-labs.org>
; ~18% performance improvement

; SHA-256 CPU SSE cruncher for Bitcoin Miner

ALIGN 32
BITS 64

%define hash rdi
%define data rsi
%define init rdx

; 0 = (1024 - 256) (mod (LAB_CALC_UNROLL*LAB_CALC_PARA*16))
%define LAB_CALC_PARA 2
%define LAB_CALC_UNROLL 8

%define LAB_LOOP_UNROLL 8

extern g_4sha256_k

global CalcSha256_x64_sse4
; CalcSha256 hash(rdi), data(rsi), init(rdx)
CalcSha256_x64_sse4:

push rbx

LAB_NEXT_NONCE:

mov rcx, 256 ; 256 - rcx is # of SHA-2 rounds
; mov rax, 64 ; 64 - rax is where we expand to

LAB_SHA:
push rcx
lea rcx, qword [data+(1024)] ; + 1024
lea r11, qword [data+(256)] ; + 256

LAB_CALC:
%macro lab_calc_blk 1

movntdqa xmm0, [r11-(15-%1)*16] ; xmm0 = W[I-15]
movntdqa xmm4, [r11-(15-(%1+1))*16] ; xmm4 = W[I-15+1]
movdqa xmm2, xmm0 ; xmm2 = W[I-15]
movdqa xmm6, xmm4 ; xmm6 = W[I-15+1]

psrld xmm0, 3 ; xmm0 = W[I-15] >> 3
movdqa xmm1, xmm0 ; xmm1 = W[I-15] >> 3
pslld xmm2, 14 ; xmm2 = W[I-15] << 14
psrld xmm4, 3 ; xmm4 = W[I-15+1] >> 3
movdqa xmm5, xmm4 ; xmm5 = W[I-15+1] >> 3
psrld xmm5, 4 ; xmm5 = W[I-15+1] >> 7
pxor xmm4, xmm5 ; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7)
pslld xmm6, 14 ; xmm6 = W[I-15+1] << 14
psrld xmm1, 4 ; xmm1 = W[I-15] >> 7
pxor xmm0, xmm1 ; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7)
pxor xmm0, xmm2 ; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14)
psrld xmm1, 11 ; xmm1 = W[I-15] >> 18
psrld xmm5, 11 ; xmm5 = W[I-15+1] >> 18
pxor xmm4, xmm6 ; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14)
pxor xmm4, xmm5 ; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18)
pslld xmm2, 11 ; xmm2 = W[I-15] << 25
pslld xmm6, 11 ; xmm6 = W[I-15+1] << 25
pxor xmm4, xmm6 ; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18) ^ (W[I-15+1] << 25)
pxor xmm0, xmm1 ; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18)
pxor xmm0, xmm2 ; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18) ^ (W[I-15] << 25)
paddd xmm0, [r11-(16-%1)*16] ; xmm0 = s0(W[I-15]) + W[I-16]
paddd xmm4, [r11-(16-(%1+1))*16] ; xmm4 = s0(W[I-15+1]) + W[I-16+1]
movntdqa xmm3, [r11-(2-%1)*16] ; xmm3 = W[I-2]
movntdqa xmm7, [r11-(2-(%1+1))*16] ; xmm7 = W[I-2+1]

;;;;;;;;;;;;;;;;;;

movdqa xmm2, xmm3 ; xmm2 = W[I-2]
psrld xmm3, 10 ; xmm3 = W[I-2] >> 10
movdqa xmm1, xmm3 ; xmm1 = W[I-2] >> 10
movdqa xmm6, xmm7 ; xmm6 = W[I-2+1]
psrld xmm7, 10 ; xmm7 = W[I-2+1] >> 10
movdqa xmm5, xmm7 ; xmm5 = W[I-2+1] >> 10

paddd xmm0, [r11-(7-%1)*16] ; xmm0 = s0(W[I-15]) + W[I-16] + W[I-7]
paddd xmm4, [r11-(7-(%1+1))*16] ; xmm4 = s0(W[I-15+1]) + W[I-16+1] + W[I-7+1]

pslld xmm2, 13 ; xmm2 = W[I-2] << 13
pslld xmm6, 13 ; xmm6 = W[I-2+1] << 13
psrld xmm1, 7 ; xmm1 = W[I-2] >> 17
psrld xmm5, 7 ; xmm5 = W[I-2+1] >> 17



pxor xmm3, xmm1 ; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17)
psrld xmm1, 2 ; xmm1 = W[I-2] >> 19
pxor xmm3, xmm2 ; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13)
pslld xmm2, 2 ; xmm2 = W[I-2] << 15
pxor xmm7, xmm5 ; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17)
psrld xmm5, 2 ; xmm5 = W[I-2+1] >> 19
pxor xmm7, xmm6 ; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13)
pslld xmm6, 2 ; xmm6 = W[I-2+1] << 15



pxor xmm3, xmm1 ; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19)
pxor xmm3, xmm2 ; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19) ^ (W[I-2] << 15)
paddd xmm0, xmm3 ; xmm0 = s0(W[I-15]) + W[I-16] + s1(W[I-2]) + W[I-7]
pxor xmm7, xmm5 ; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19)
pxor xmm7, xmm6 ; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19) ^ (W[I-2+1] << 15)
paddd xmm4, xmm7 ; xmm4 = s0(W[I-15+1]) + W[I-16+1] + s1(W[I-2+1]) + W[I-7+1]

movdqa [r11+(%1*16)], xmm0
movdqa [r11+((%1+1)*16)], xmm4
%endmacro

%assign i 0
%rep    LAB_CALC_UNROLL
        lab_calc_blk i
%assign i i+LAB_CALC_PARA
%endrep

add r11, LAB_CALC_UNROLL*LAB_CALC_PARA*16
cmp r11, rcx
jb LAB_CALC

pop rcx
mov rax, 0

; Load the init values of the message into the hash.

movntdqa xmm7, [init]
movntdqa xmm0, [init+16]
pshufd xmm5, xmm7, 0x55 ; xmm5 == b
pshufd xmm8, xmm0, 0x55 ; xmm8 == f
pshufd xmm4, xmm7, 0xAA ; xmm4 == c
pshufd xmm9, xmm0, 0xAA ; xmm9 == g
pshufd xmm3, xmm7, 0xFF ; xmm3 == d
pshufd xmm10, xmm0, 0xFF ; xmm10 == h
pshufd xmm7, xmm7, 0 ; xmm7 == a
pshufd xmm0, xmm0, 0 ; xmm0 == e

LAB_LOOP:

;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32<T>(g_sha256_k[j]) + w[j]

%macro lab_loop_blk 0
movntdqa xmm6, [data+rax*4]
paddd xmm6, g_4sha256_k[rax*4]
add rax, 4

paddd xmm6, xmm10 ; +h

movdqa xmm1, xmm0
; movdqa xmm2, xmm9 ; It's redundant unless xmm9 becomes a destination
pandn xmm1, xmm9 ; ~e & g Changed from xmm2 to xmm9

movdqa xmm10, xmm9 ; h = g  Changed from xmm2 to xmm9
movdqa xmm9, xmm8 ; f
movdqa xmm2, xmm8 ; g = f xmm9 became a destination but not until xmm2 was already used and replaced

pand xmm2, xmm0 ; e & f
pxor xmm1, xmm2 ; (e & f) ^ (~e & g)
paddd xmm6, xmm1 ; Ch + h + w[i] + k[i]

movdqa xmm8, xmm0 ; f = e Combining these three moves for processor hardware optimization
movdqa xmm1, xmm0
movdqa xmm2, xmm0
psrld xmm0, 6 ; The xmm2 from xmm0 move used to be after this taking advantage of the r-rotate 6
pslld xmm1, 7
psrld xmm2, 11 ; Changed from 5 to 11 after shoving the movdqa commands together
pxor xmm0, xmm1
pxor xmm0, xmm2
pslld xmm1, 14
psrld xmm2, 14
pxor xmm0, xmm1
pxor xmm0, xmm2
pslld xmm1, 5
pxor xmm0, xmm1 ; Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)
paddd xmm6, xmm0 ; xmm6 = t1

movdqa xmm0, xmm3 ; d
paddd xmm0, xmm6 ; e = d+t1

movdqa xmm1, xmm5 ; =b
movdqa xmm3, xmm4 ; d = c
movdqa xmm2, xmm4 ; c
pand xmm2, xmm5 ; b & c
pand xmm4, xmm7 ; a & c
pand xmm1, xmm7 ; a & b
pxor xmm1, xmm4
movdqa xmm4, xmm5 ; c = b
movdqa xmm5, xmm7 ; b = a
pxor xmm1, xmm2 ; (a & c) ^ (a & d) ^ (c & d)
paddd xmm6, xmm1 ; t1 + ((a & c) ^ (a & d) ^ (c & d))

movdqa xmm2, xmm7
psrld xmm7, 2
movdqa xmm1, xmm7
pslld xmm2, 10
psrld xmm1, 11
pxor xmm7, xmm2
pxor xmm7, xmm1
pslld xmm2, 9
psrld xmm1, 9
pxor xmm7, xmm2
pxor xmm7, xmm1
pslld xmm2, 11
pxor xmm7, xmm2
paddd xmm7, xmm6 ; a = t1 + (Rotr32(a, 2) ^ Rotr32(a, 13) ^ Rotr32(a, 22)) + ((a & c) ^ (a & d) ^ (c & d));
%endmacro

%assign i 0
%rep    LAB_LOOP_UNROLL
        lab_loop_blk
%assign i i+1
%endrep

cmp rax, rcx
jb LAB_LOOP

; Finished the 64 rounds, calculate hash and save

movntdqa xmm1, [rdx]
pshufd xmm2, xmm1, 0x55
paddd xmm5, xmm2
pshufd xmm6, xmm1, 0xAA
paddd xmm4, xmm6
pshufd xmm11, xmm1, 0xFF
paddd xmm3, xmm11
pshufd xmm1, xmm1, 0
paddd xmm7, xmm1

movntdqa xmm1, [rdx+16]
pshufd xmm2, xmm1, 0x55
paddd xmm8, xmm2
pshufd xmm6, xmm1, 0xAA
paddd xmm9, xmm6
pshufd xmm11, xmm1, 0xFF
paddd xmm10, xmm11
pshufd xmm1, xmm1, 0
paddd xmm0, xmm1

movdqa [hash], xmm7
movdqa [hash+16], xmm5
movdqa [hash+32], xmm4
movdqa [hash+48], xmm3
movdqa [hash+64], xmm0
movdqa [hash+80], xmm8
movdqa [hash+96], xmm9
movdqa [hash+112], xmm10

LAB_RET:
pop rbx
ret

I'll be attacking the LAB_LOOP next.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
xcooling
Member
**
Offline Offline

Activity: 145
Merit: 10


View Profile
July 30, 2011, 11:05:55 PM
 #611

nice contribution..

still trying to get 64bit builds working on  win7 64bit..

plantucha
Newbie
*
Offline Offline

Activity: 56
Merit: 0


View Profile WWW
July 30, 2011, 11:52:55 PM
 #612

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to.  Here's what I've done so far.  It's not much, but it works.  Don't go changing the github source just yet though.  For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file.  For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.


I'll be attacking the LAB_LOOP next.

where is located sse2_amd64 file for AMD users?

in /x86_64
is only:

sha256_sse4_amd64.asm
sha256_xmm_amd64.asm

I don't see anywhere:
sha256_sse2_amd64.asm




RudeDude
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
July 30, 2011, 11:53:09 PM
Last edit: July 31, 2011, 12:04:53 AM by RudeDude
 #613

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to.  Here's what I've done so far.  It's not much, but it works.  Don't go changing the github source just yet though.  For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file.  For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

I pasted your ASM into sha256_xmm_amd64.asm and changed "movntdqa" to "movdqa" like you said for sse2. But I get a linker error.
Code:
...
cgminer-sha256_sse2_amd64.o: In function `scanhash_sse2_64':
sha256_sse2_amd64.c:(.text+0x4fb): undefined reference to `CalcSha256_x64'
sha256_sse2_amd64.c:(.text+0x50b): undefined reference to `CalcSha256_x64'
collect2: ld returned 1 exit status
...

I had to change "CalcSha256_x64_sse4" to "CalcSha256_x64" in two spots. Then the compile went just fine. I'm running now to see if it's any faster and if any work actually gets accepted bu t hopefully it's bug free.

btw, doesn't the assembler do basic inline math before assembling?

P.S. Hashrate looks really close to the same but I did get a work unit accepted just now.

EDIT: so the increase in speed, if any, is around 1% increase maybe slightly more. I only have two cores at 3.5 Mh/s each so it's hard to see the difference on the scale of Mhash/s.
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
July 31, 2011, 12:26:38 AM
 #614

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to.  Here's what I've done so far.  It's not much, but it works.  Don't go changing the github source just yet though.  For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file.  For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

I pasted your ASM into sha256_xmm_amd64.asm and changed "movntdqa" to "movdqa" like you said for sse2. But I get a linker error.
Code:
...
cgminer-sha256_sse2_amd64.o: In function `scanhash_sse2_64':
sha256_sse2_amd64.c:(.text+0x4fb): undefined reference to `CalcSha256_x64'
sha256_sse2_amd64.c:(.text+0x50b): undefined reference to `CalcSha256_x64'
collect2: ld returned 1 exit status
...

I had to change "CalcSha256_x64_sse4" to "CalcSha256_x64" in two spots. Then the compile went just fine. I'm running now to see if it's any faster and if any work actually gets accepted bu t hopefully it's bug free.

btw, doesn't the assembler do basic inline math before assembling?

P.S. Hashrate looks really close to the same but I did get a work unit accepted just now.

EDIT: so the increase in speed, if any, is around 1% increase maybe slightly more. I only have two cores at 3.5 Mh/s each so it's hard to see the difference on the scale of Mhash/s.

Admittedly, there won't be much of a speed improvement just yet as I haven't really gone after the main loop.  The vast majority of changes I've made only apply to just before the work is inserted into the loop.  Also, leaving things to the assembler to do with the assumption that it will do it tends to leave room for problems to occur.  Sometimes, a change that you think will take place doesn't and ends up adding to the CPU instructions to calculate.  Often best to head them off before hand.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
July 31, 2011, 12:29:57 AM
 #615

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to.  Here's what I've done so far.  It's not much, but it works.  Don't go changing the github source just yet though.  For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file.  For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.


I'll be attacking the LAB_LOOP next.

where is located sse2_amd64 file for AMD users?

in /x86_64
is only:

sha256_sse4_amd64.asm
sha256_xmm_amd64.asm

I don't see anywhere:
sha256_sse2_amd64.asm





Oops, sorry.  The xmm version is what I meant.  I keep thinking sse2 and sse4 for ease of my mind and maintaining difference of programming instructions.  I'm looking for places to implement SSE3 instructions to run math calculations on dwords simultaneously, but I would have to restructure the entire program to take advantage of it and even then I'm not sure if it would work better or worse.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
plantucha
Newbie
*
Offline Offline

Activity: 56
Merit: 0


View Profile WWW
July 31, 2011, 12:36:32 AM
Last edit: July 31, 2011, 12:53:12 AM by plantucha
 #616

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to.  Here's what I've done so far.  It's not much, but it works.  Don't go changing the github source just yet though.  For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file.  For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.


I'll be attacking the LAB_LOOP next.

where is located sse2_amd64 file for AMD users?

in /x86_64
is only:

sha256_sse4_amd64.asm
sha256_xmm_amd64.asm

I don't see anywhere:
sha256_sse2_amd64.asm

Oops, sorry.  The xmm version is what I meant.  I keep thinking sse2 and sse4 for ease of my mind and maintaining difference of programming instructions.  I'm looking for places to implement SSE3 instructions to run math calculations on dwords simultaneously, but I would have to restructure the entire program to take advantage of it and even then I'm not sure if it would work better or worse.

AMD phenom X6

sse2              17.4 MHash/s
fixed sse2      18.0 MHash/s
4way              20.4 MHash/s
sse4               illegal instruction


edit:
4way works in 1.4.down 
in 1.5.up works too (same speed), but everything is rejected
c_k
Donator
Full Member
*
Offline Offline

Activity: 242
Merit: 100



View Profile
July 31, 2011, 01:44:28 AM
 #617

Great work ckolivas!

It looks like this will become the miner of choice with all the slick features you are adding.

Could you look at adding an option for monitoring the GPU temperature and backing off when it hits a maximum value and not resuming until it hits another minimum value?

If you included this you would be negating the need to ever use anything else imo Smiley

d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
July 31, 2011, 03:59:50 AM
 #618

Great work ckolivas!

It looks like this will become the miner of choice with all the slick features you are adding.

Could you look at adding an option for monitoring the GPU temperature and backing off when it hits a maximum value and not resuming until it hits another minimum value?

If you included this you would be negating the need to ever use anything else imo Smiley
You know, technically, that feature should be maintained by the GPU itself.  But I know that ufasoft has implemented it for some reason.  It's more of a safeguard against failure of the hardware's throttle.  Alternatively, you could try adjusting the fan speed of your card using free software so as to increase the fan speed at higher temps.  Could help to not reach that temperature.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
July 31, 2011, 04:04:03 AM
 #619

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to.  Here's what I've done so far.  It's not much, but it works.  Don't go changing the github source just yet though.  For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file.  For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.


I'll be attacking the LAB_LOOP next.

where is located sse2_amd64 file for AMD users?

in /x86_64
is only:

sha256_sse4_amd64.asm
sha256_xmm_amd64.asm

I don't see anywhere:
sha256_sse2_amd64.asm

Oops, sorry.  The xmm version is what I meant.  I keep thinking sse2 and sse4 for ease of my mind and maintaining difference of programming instructions.  I'm looking for places to implement SSE3 instructions to run math calculations on dwords simultaneously, but I would have to restructure the entire program to take advantage of it and even then I'm not sure if it would work better or worse.

AMD phenom X6

sse2              17.4 MHash/s
fixed sse2      18.0 MHash/s
4way              20.4 MHash/s
sse4               illegal instruction


edit:
4way works in 1.4.down 
in 1.5.up works too (same speed), but everything is rejected

I'll take a look at 4-way for you later tonight.  The SSE4 illegal instruction would be the non-temporal move of the double-quad words from memory to the xmm registers via the movntdqa function not supported on AMD processors as of yet.  I'm looking for a way to break down the data to double words so that I can take advantage of AMD's movntdd command to achieve the same thing, but it's more difficult than it sounds.  I'm actually surprised that 4way is working faster for you than SSE2.  Is this all after the threads normalized?

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
GenTarkin
Legendary
*
Offline Offline

Activity: 2450
Merit: 1002


View Profile
July 31, 2011, 06:39:29 AM
 #620

any chance on an updated win32 build sometime? its still at v1.5.1....

GenTarkin's MOD Kncminer Titan custom firmware! v1.0.4! -- !!NO LONGER AVAILABLE!!
Donations: bitcoin- 1Px71mWNQNKW19xuARqrmnbcem1dXqJ3At || litecoin- LYXrLis3ik6TRn8tdvzAyJ264DRvwYVeEw
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [31] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 ... 843 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!