Bitcoin Forum
April 24, 2024, 10:35:03 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 [3] 4 5 6 7 8 9 10 11 12 13 14 »  All
  Print  
Author Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13  (Read 51193 times)
1MLyg5WVFSMifFjkrZiyGW2nw
Newbie
*
Offline Offline

Activity: 28
Merit: 0


View Profile
July 03, 2011, 07:46:41 PM
 #41

Hello,

you might want to change this

Code:
#define sharound(n) { t1 = t1(n); Vals[(131 - n) % 8] += t1(n); Vals[(135 - n) % 8] = t1(n) + t2(n); }

to

Code:
#define sharound(n) { t1 = t1(n); Vals[(131 - n) % 8] += t1; Vals[(135 - n) % 8] = t1 + t2(n); }

This got me a 25% performance increase!
Seems good, but it brings no gain for me ... weird. Will look into this and perhaps re-use your idea, if I may Smiley.
Can only think of a good compiler optimization...
It is not obvious that this will gain anything.  On the contrary, unless the compiler optimize it away, it makes the second and third instruction dependant on the first.  This is bad on a GPU which issues 4 or 5 instructions in parallel every clock cycle.

Well, it worked for me, might be because I only have a slow 4670. However, the same change in sharound2 decreases performance.

Another thing that seems to run a little bit faster on cards without BFI_INT:
Code:
#define Ma(x, y, z) Ch((z^x), (y), (x))
1713954903
Hero Member
*
Offline Offline

Posts: 1713954903

View Profile Personal Message (Offline)

Ignore
1713954903
Reply with quote  #2

1713954903
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1713954903
Hero Member
*
Offline Offline

Posts: 1713954903

View Profile Personal Message (Offline)

Ignore
1713954903
Reply with quote  #2

1713954903
Report to moderator
1713954903
Hero Member
*
Offline Offline

Posts: 1713954903

View Profile Personal Message (Offline)

Ignore
1713954903
Reply with quote  #2

1713954903
Report to moderator
1713954903
Hero Member
*
Offline Offline

Posts: 1713954903

View Profile Personal Message (Offline)

Ignore
1713954903
Reply with quote  #2

1713954903
Report to moderator
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 03, 2011, 08:07:14 PM
 #42

Hello,

you might want to change this

Code:
#define sharound(n) { t1 = t1(n); Vals[(131 - n) % 8] += t1(n); Vals[(135 - n) % 8] = t1(n) + t2(n); }

to

Code:
#define sharound(n) { t1 = t1(n); Vals[(131 - n) % 8] += t1; Vals[(135 - n) % 8] = t1 + t2(n); }

This got me a 25% performance increase!
Seems good, but it brings no gain for me ... weird. Will look into this and perhaps re-use your idea, if I may Smiley.
Can only think of a good compiler optimization...
It is not obvious that this will gain anything.  On the contrary, unless the compiler optimize it away, it makes the second and third instruction dependant on the first.  This is bad on a GPU which issues 4 or 5 instructions in parallel every clock cycle.

I wish I could use the AMD APP KernelAnalyzer, but it currently won´t work, so I rely on the pure MHash/sec numbers that Phoenix gives me :-/. Anyway, I perhaps will try to follow your hint. If it´s faster to do a calculation twice but be independend of eachother then okay Cheesy.

My work is not over Wink.

Thanks for the 3 people who sent a donation so far! It´s a nice motivation. Did anyone repost this in the Mining Software forum? I am not allowed to post there ^^.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
bitless
Newbie
*
Offline Offline

Activity: 28
Merit: 0


View Profile
July 03, 2011, 09:55:19 PM
 #43

Good, but this

> - added: "u t1W" variable, which is used in sharound2() to avoid double execution of t1W()

may actually hurt the performance, in theory. If you're using more registers, at least some GPUs may not be able to run as many threads concurrently as they used to, thus slowing things down.

plantucha
Newbie
*
Offline Offline

Activity: 56
Merit: 0


View Profile WWW
July 03, 2011, 10:29:58 PM
 #44

Hello,

you might want to change this

Code:
#define sharound(n) { t1 = t1(n); Vals[(131 - n) % 8] += t1(n); Vals[(135 - n) % 8] = t1(n) + t2(n); }

to

Code:
#define sharound(n) { t1 = t1(n); Vals[(131 - n) % 8] += t1; Vals[(135 - n) % 8] = t1 + t2(n); }

This got me a 25% performance increase!
Seems good, but it brings no gain for me ... weird. Will look into this and perhaps re-use your idea, if I may Smiley.
Can only think of a good compiler optimization...
It is not obvious that this will gain anything.  On the contrary, unless the compiler optimize it away, it makes the second and third instruction dependant on the first.  This is bad on a GPU which issues 4 or 5 instructions in parallel every clock cycle.

I wish I could use the AMD APP KernelAnalyzer, but it currently won´t work, so I rely on the pure MHash/sec numbers that Phoenix gives me :-/. Anyway, I perhaps will try to follow your hint. If it´s faster to do a calculation twice but be independend of eachother then okay Cheesy.

My work is not over Wink.

Thanks for the 3 people who sent a donation so far! It´s a nice motivation. Did anyone repost this in the Mining Software forum? I am not allowed to post there ^^.

Dia

newbie rules are pretty hard here.
you do have to spend more than 4 hours play with this forum to become able post anywhere
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 04, 2011, 04:32:05 AM
 #45

Good, but this

> - added: "u t1W" variable, which is used in sharound2() to avoid double execution of t1W()

may actually hurt the performance, in theory. If you're using more registers, at least some GPUs may not be able to run as many threads concurrently as they used to, thus slowing things down.

That one was removed a few hours after I added it, don´t worry Smiley. You can safely remove "u t1W;" and replace "t1W = t1w(n);" with "t1 = t1W(n);" in sharound2.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Alex AXe
Legendary
*
Offline Offline

Activity: 1218
Merit: 1019



View Profile
July 04, 2011, 05:21:54 AM
 #46

360 -> 362 HD6950@900MHz  Smiley

BTC: 1PReUo29T7zSq9RjfBUbcVaYwfZ66mGvzs |  Укpaинcкaя биpжa кpиптoвaлют
r4in
Newbie
*
Offline Offline

Activity: 36
Merit: 0


View Profile
July 04, 2011, 09:56:59 AM
 #47

Thanks alot for this.

303 -> 309 @ radeon 6870 (1005/350) using phoenix with your kernel!
xurious
Sr. Member
****
Offline Offline

Activity: 413
Merit: 250


View Profile
July 04, 2011, 02:02:56 PM
 #48

Was using some patch to get a few extra mh/s yesterday, but I just downloaded this new one and get about 6 more! Badass!

I need to find a way to stop having to implement all these changes across all my machines! Cheesy

Thanks!

SiaMining.com -- First PPS SiaMining Pool! 3%, VarDiff, Stratum Support
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 04, 2011, 02:06:36 PM
 #49

All those who are happy and gain a few MHash/sec make me proud and happy, too Smiley. Keep up posting here!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
erek
Newbie
*
Offline Offline

Activity: 36
Merit: 0


View Profile
July 04, 2011, 04:11:29 PM
 #50

2x 6970s:  755 (old) -> 781 (new)


thanks!
DullJack
Newbie
*
Offline Offline

Activity: 42
Merit: 0


View Profile
July 04, 2011, 06:19:10 PM
 #51

Very nice, will try this on my new rig.
nebiki
Sr. Member
****
Offline Offline

Activity: 742
Merit: 250



View Profile
July 04, 2011, 06:30:58 PM
 #52

went up from 385 to 398. didn't use the 3% thing before. thanks. now i'll have to look at the stale rates.

THE BEST IN SPACE
AUTO-COMPOUNDING DEFI 3.0
PROTOCOL ON BSC
▀█▄▄▄                                                                      ▄▄▄█▀
▀██
████▄▄▄                                                          ▄▄▄██████▀
▀▀███
██████▄▄▄                                              ▄▄▄█████████▀▀
▀▀████
████████▄▄▄                                ▄▄▄████████████▀▀
▀████████████████▄ ▄▄                  ▄▄ ▄████████████████▀
████████████████████▄▄          ▄▄████████████████████
▀█████████████████████        █████████████████████▀
▀████
███████████████▌      ▐███████████████
████▀
▀▀█████████
██████████████████
█████████▀▀
▀████████████████████████████████▀

▀███████████████████████████▀

▀██
█████ ███    ███ ███████▀
▀▀███   ██    ██   ███▀▀
5 0 1 , 6 5 2 %   A P Y
|    TWITTER    |   TELEGRAM   |    DISCORD    |
█▀▀▀▀▀











█▄▄▄▄▄
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
.
JOIN NOW
.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
▀▀▀▀▀█











▄▄▄▄▄█
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 06, 2011, 01:08:32 PM
 #53

New version is ready, DL here: http://www.mediafire.com/?f8b8q3w5u5p0ln0

Updated first post with changelog and performance info. This one should be a bit faster on 69XX cards than the original phatk, faster than all other phatk versions I did on 58XX and faster on non BFI_INT cards because of a change user 1MLyg5WVFSMifFjkrZiyGW2nw suggested!

Try, have fun, comment and donate Cheesy.

Thanks,
Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
dewon
Newbie
*
Offline Offline

Activity: 55
Merit: 0


View Profile
July 06, 2011, 03:20:00 PM
 #54

Went from 322 to 329 with hd 5830
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 06, 2011, 03:38:11 PM
 #55

Went from 322 to 329 with hd 5830

Great, seems liket HD5830 scales really well with my mod Smiley.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
1MLyg5WVFSMifFjkrZiyGW2nw
Newbie
*
Offline Offline

Activity: 28
Merit: 0


View Profile
July 06, 2011, 05:23:55 PM
Last edit: July 06, 2011, 06:13:15 PM by 1MLyg5WVFSMifFjkrZiyGW2nw
 #56

New version is ready, DL here: http://www.mediafire.com/?f8b8q3w5u5p0ln0

Updated first post with changelog and performance info. This one should be a bit faster on 69XX cards than the original phatk, faster than all other phatk versions I did on 58XX and faster on non BFI_INT cards because of a change user 1MLyg5WVFSMifFjkrZiyGW2nw suggested!

Try, have fun, comment and donate Cheesy.

Thanks,
Dia

Thanks, best version yet Cheesy
Still not reached the 40 MHash/sec the wiki says my card could do  Huh

Did you notice that Ma(x, y, z) is defined exactly the same now whether BFI_INT is enabled or not? Seems more elegant to me if moved out of the #ifdef. Also I tried to replace some #define's with functions, guessing that it would make it easier for a somewhat smart compiler to find repeatedly used terms and put them into registers. No performance improvement, but didn't hurt it either.

Also, OpenCL has a builtin Ch function, not faster for me but maybe for someone else:
#define Ch(x, y, z) bitselect(z, y, x)
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 06, 2011, 07:30:15 PM
 #57

Thanks, best version yet Cheesy
Still not reached the 40 MHash/sec the wiki says my card could do  Huh

Did you notice that Ma(x, y, z) is defined exactly the same now whether BFI_INT is enabled or not? Seems more elegant to me if moved out of the #ifdef. Also I tried to replace some #define's with functions, guessing that it would make it easier for a somewhat smart compiler to find repeatedly used terms and put them into registers. No performance improvement, but didn't hurt it either.

Also, OpenCL has a builtin Ch function, not faster for me but maybe for someone else:
#define Ch(x, y, z) bitselect(z, y, x)


Thank YOU Smiley another nice hint, even if it not boosts, the code gets cleaner. I'm not sure about the #define as functions. Could you post an example? My problem is, that most variables are defined and declared inside the kernel function. So a function for sharound() for example needs Vals[] and others as passed parameters (copy or pointer).

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
1MLyg5WVFSMifFjkrZiyGW2nw
Newbie
*
Offline Offline

Activity: 28
Merit: 0


View Profile
July 06, 2011, 08:23:47 PM
 #58

Thank YOU Smiley another nice hint, even if it not boosts, the code gets cleaner. I'm not sure about the #define as functions. Could you post an example? My problem is, that most variables are defined and declared inside the kernel function. So a function for sharound() for example needs Vals[] and others as passed parameters (copy or pointer).

Dia

Yes I haven't changed those defines yet, mostly added my own intermediate functions:

Code:
// Ma can also be implemented in terms of Ch...
u Ma(u x, u y, u z) { return Ch(z^x, y, x); }

// Various intermediate calculations for each SHA round

u xrot2(u n, const uint r1, const uint r2) {
        return rot(n, r1) ^ rot(n, r2);
}

u xrot3(u n, const uint r1, const uint r2, const uint r3) {
        return xrot2(n, r1, r2) ^ rot(n, r3);
}

u xrrs(u n, const uint r1, const uint r2, const uint r3) {
        return xrot2(n, r1, r2) ^ (n >> r3);
}

#define s0(n) xrot3(Vals[(128-n) % 8], 30, 19, 10)
#define s1(n) xrot3(Vals[(132-n) % 8], 26, 21, 7)
#define ch(n) Ch(Vals[(132 - n) % 8], Vals[(133 - n) % 8], Vals[(134 - n) % 8])
#define ma(n) Ma(Vals[(129 - n) % 8], Vals[(130 - n) % 8], Vals[(128 - n) % 8])
#define t1(n) (K[n % 64] + Vals[(135 - n) % 8] + W[n] + s1(n) + ch(n))

// intermediate W calculations
#define P1(x) xrrs(W[x - 2], 15, 13, 10)
#define P2(x) xrrs(W[x - 15], 25, 14, 3)

Since there is no noticeable drop in hashrate, I assume the compiler is inlining these functions.

Also, you can eliminate one extra assignment to Vals[4]:

Code:
//Vals[4] = PreVal4;
//...
#ifdef VECTORS.
        Vals[4] = (W[3] = ((base + get_global_id(0)) << 1) + (uint2)(0, 1)) + PreVal4;
#else
        Vals[4] = (W[3] = base + get_global_id(0)) + PreVal4;
#endif
//...
//Vals[4] += W[3];
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 06, 2011, 09:01:00 PM
 #59

I really like this "let's do better"-game Smiley. But for now I say good n8!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
conspirosphere.tk
Legendary
*
Offline Offline

Activity: 2352
Merit: 1064


Bitcoin is antisemitic


View Profile
July 06, 2011, 11:28:17 PM
 #60

I don't have a benchmark, but according phoenix miner I passed from about 160Mhs to 180+ using your patch with my 5770.
Good work!
Many thanks
Pages: « 1 2 [3] 4 5 6 7 8 9 10 11 12 13 14 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!