Bitcoin Forum
May 08, 2024, 12:16:56 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: 1 2 3 4 5 6 [All]
  Print  
Author Topic: DiaKGCN kernel for CGMINER + Phoenix 2 (79XX / 78XX / 77XX / GCN) - 2012-05-25  (Read 27713 times)
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 27, 2012, 06:56:42 PM
Last edit: September 20, 2012, 05:15:46 PM by Diapolo
 #1

DiaKGCN is a work-in-progress GCN optimised mining-kernel for CGMINER and Phoenix 2. Currently it ate weeks of hard work and trial and error. It will run on VLIW4 and VLIW5 GPUs just fine, but it's not optimised for them.

As the kernel is now part of CGMINER since version 2.2.7, there is no need to download additional files, you can use it out of the box. I will supply an updated kernel package for Phoenix 2, when the final version is available!

I'd like to get feedback, performance results and ideas to optimise it even further!
To support the further development of this kernel please donate to: 1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x (0.94 BTC donated so far, thanks!)

Diapolo



CGMINER thread with download links and documentation:
https://bitcointalk.org/index.php?topic=28402.0

DiaKGCN - Phoenix 2 download history:
https://anonfiles.com/file/a88219997407050d4b2ec153b35b2c0a
http://www.filedropper.com/diakgcnphoenix2
http://www.filedropper.com/diakgcnphoenix2preview_1

DiaKGCN - Phoenix 1 download history (just for reference):
http://www.filedropper.com/diakgcn04-02-2012
http://www.filedropper.com/diakgcn03-02-2012_1
http://www.filedropper.com/diakgcn02-02-2012
http://www.filedropper.com/diakgcn29-01-2012
http://www.filedropper.com/diakgcn28-01-2012



instructions for CGMINER

To use the current optimal settings on 79XX cards add this parameters to your CGMINER command-line:
Code:
-k diakgcn -v 2 -w 256

You need CGMINER >= 2.2.7 to be able to use diakgcn!



instructions for Phoenix 2

Place the folder diakgcn in phoenix2\plugins and use this for your config-file on 79XX cards (here it's for platform and device 0):
Code:
[cl:0:0]
kernel = diakgcn
aggression = 12
goffset = true
vectors2 = true
vectors4 = false
vectors8 = false
worksize = 256

For VLIW4 / VLIW5 you should use:
Code:
[cl:0:0]
kernel = diakgcn
aggression = 12
goffset = true
vectors2 = false
vectors4 = false
vectors8 = true
worksize = 128

With the current Phoenix 2 version don't use 1 instance with mixed GCN or VLIW4 / VLIW5 GPUs as this will lead to very poor performance!



instructions for Phoenix 1

Place the folder diakgcn in phoenix\kernels and use this command line on 79XX cards:
Code:
-k diakgcn AGGRESSION=12 VECTORS2 WORKSIZE=256

For VLIW4 / VLIW5 you should use:
Code:
-k diakgcn AGGRESSION=12 VECTORS4 WORKSIZE=128
or
Code:
-k diakgcn AGGRESSION=12 VECTORS8 WORKSIZE=128

If you encounter high CPU usage and use multiple cards, try to give each Phoenix instance a single CPU core (set a CPU affinity)!



DiaKGCN parameter description for Phoenix

BFI_INT
Use BFI_INT instruction patching (default is true).

GOFFSET
Use OpenCL 1.1 global offset parameter (default is true).

VECTORS2
Enable uint2 vector support in the kernel (default is false).

VECTORS4
Enable uint4 vector support in the kernel (default is false).

VECTORS8
Enable uint8 vector support in the kernel (default is false).



BFI_INT patching whitelist (only VLIW4 / VLIW5 GPUs)

Barts
BeaverCreek
Caicos
Cayman
Cedar
Cypress
Devastator
Juniper
Loveland
Redwood
Scrapper
Turks
WinterPark



changelog 04-02-2012:
- added uint8 vectors support in the kernel and the init (use VECTORS8 switch to activate it)
- added GOFFSET switch to be able to disable global offset parameter (use GOFFSET=False to disable it)
  -> perhaps GOFFSET is slower for some, now you can try the alternative
- changed some kernel parameter descriptions
- removed unused VECTORS3 code, never got it working :-/
- renamed OpenCL11 flag to hasOpenCL11 in the init
- removed some unneeded references to phatk from the init
- added a few comments in the init
- upped init revision to 127

changelog 03-02-2012:
- fixed the VECTORS4 code-path, which is now usable again
  -> VECTORS4 should be beneficial for VLIW4 / VLIW5, but not for GCN
- removed the (u) typecasts in the non BFI_INT Ch() and Ma() versions
  -> the hex values, who are directly used in Ch() or Ma() were changed to be unsigned
- added 2 different Ma() versions, one for VECTORS2 or VECTORS4 defined (was in before), the other for the scalar version of the kernel (new)
  -> new scalar version saves 4 Bytes in compiled GPU ISA code (but VECTORS2 is still fastest for GCN)
- hardened the BFI_INT auto patching code in the init
  -> a whitelisted OpenCL device is now checked for cl_amd_media_ops extension
- fixed a small bug where I tried to use the C-operator "&" as a "logical and" in the init
  -> changed into an Python "and" ^^
- removed a few lines unused code from the init
- upped init revision to 126

changelog 02-02-2012:
- added an automatic usage of the OpenCL 1.1 global offset parameter, on OpenCL >= 1.1 platforms -> Thanks DiabloD3 for the idea
- removed both __constant arrays in the kernel, values are now used directly
- changed Ma() function from a general one into faster ones for the BFI_INT path and the non BFI_INT path
- added new kernel parameters (W16addK16, W17addK17, state0A and state0B)
- added 2 new local variables state0AaddV0 and state0BaddV0
- rewrote some rounds to use new kernel parameters and variables for faster execution
- fixed a write to output buffer bug for the non VECTORS path in the kernel
- changed the BFI_INT whitelisted flag code in the init
- added an OpenCL >= 1.1 flag in the init used for activating the global offset parameter
- reactivated PyOpenCL version output in the init
- upped init revision to 125
- removed unneeded code or comments from the kernel and the init
- added DiabloMiner kernel as addition reference for getting new ideas in the kernel header

changelog 29-01-2012:
- reordered kernel parameters in order of usage in the kernel
- removed unused kernel parameters (B1addF1addK6, C1addG1addK5, D1addH1)
- added new kernel parameter (PreVal0addK7)
- rewrote first 4 rounds to speed up the kernel
- VECTORS4 parameter is not finished, it currently uses VECTORS2 code-path

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
1715127416
Hero Member
*
Offline Offline

Posts: 1715127416

View Profile Personal Message (Offline)

Ignore
1715127416
Reply with quote  #2

1715127416
Report to moderator
1715127416
Hero Member
*
Offline Offline

Posts: 1715127416

View Profile Personal Message (Offline)

Ignore
1715127416
Reply with quote  #2

1715127416
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1715127416
Hero Member
*
Offline Offline

Posts: 1715127416

View Profile Personal Message (Offline)

Ignore
1715127416
Reply with quote  #2

1715127416
Report to moderator
Roadhog2k5
Full Member
***
Offline Offline

Activity: 131
Merit: 100



View Profile
January 27, 2012, 07:55:09 PM
 #2

I have 3, 7970s I'd be willing to test on. Shoot me a pm.
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 27, 2012, 10:13:16 PM
 #3

I have 3, 7970s I'd be willing to test on. Shoot me a pm.

Done, thanks for helping Smiley.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
sveetsnelda
Hero Member
*****
Offline Offline

Activity: 642
Merit: 500


View Profile
January 27, 2012, 11:05:39 PM
 #4

Same story.  Have a 4 card rig and would be glad to help.

14u2rp4AqFtN5jkwK944nn741FnfF714m7
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 27, 2012, 11:09:54 PM
 #5

Same story.  Have a 4 card rig and would be glad to help.

PM sent, thanks!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
simplecoin
Sr. Member
****
Offline Offline

Activity: 406
Merit: 250



View Profile WWW
January 28, 2012, 12:20:38 AM
 #6

got a 1 card rig if you need it.

Donations: 1VjGJHPtLodwCFBDWsHJMdEhqRcRKdBQk
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 28, 2012, 12:28:57 AM
 #7

If all keeps this smooth, a release is just around the corner ... stay tuned.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
simplecoin
Sr. Member
****
Offline Offline

Activity: 406
Merit: 250



View Profile WWW
January 28, 2012, 01:02:02 AM
 #8

Nice work for sure! The more 7970 kernels the better Smiley

Donations: 1VjGJHPtLodwCFBDWsHJMdEhqRcRKdBQk
jjiimm_64
Legendary
*
Offline Offline

Activity: 1876
Merit: 1000


View Profile
January 28, 2012, 04:21:15 AM
 #9


I have a 4x7970 rig.  would love to test.

1jimbitm6hAKTjKX4qurCNQubbnk2YsFw
wndrbr3d
Hero Member
*****
Offline Offline

Activity: 914
Merit: 500


View Profile
January 28, 2012, 05:00:43 AM
 #10

Totes subbing to this thread. I have the money, just waiting for the results Smiley
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 28, 2012, 04:39:26 PM
 #11

A second version was sent to the testers, if others are interested in trying this out just give me a shout.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 28, 2012, 10:29:41 PM
 #12

http://www.filedropper.com/diakgcn28-01-2012

I'll leave this without comments for now ...

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
simplecoin
Sr. Member
****
Offline Offline

Activity: 406
Merit: 250



View Profile WWW
January 28, 2012, 11:51:32 PM
Last edit: January 29, 2012, 05:27:05 AM by simplecoin
 #13

Is the hashrate display broken for VECTORS4? Running VEC2/AGG10/WS256 I get ~626MH at 1080/366 (about 10 mh/s less than diablo, not bad!).

If I use VEC4 my hashrate display doubles - ~1.22GH/s. I wish this wasn't a bug or something Shocked

Yes, I see this too at stock (1.09Gh v4 agg12). Although, shares are accepted..... gonna wait to see what my site says actual shares are

UPDATE: Actual Hashrate is about the same as vectors2. Seems like a reporting issue.

Donations: 1VjGJHPtLodwCFBDWsHJMdEhqRcRKdBQk
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 29, 2012, 10:16:47 AM
 #14

Is the hashrate display broken for VECTORS4? Running VEC2/AGG10/WS256 I get ~626MH at 1080/366 (about 10 mh/s less than diablo, not bad!).

If I use VEC4 my hashrate display doubles - ~1.22GH/s. I wish this wasn't a bug or something Shocked

VEC4 is broken, sorry to say Wink ... it works with VEC2 speed currently. VEC4 seems to be not a good option for GCN.
I will polish the kernel further and supply a changelog in the future. I only wanted to get it released first.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 29, 2012, 04:18:09 PM
 #15

download current version:
http://www.filedropper.com/diakgcn29-01-2012

Should be faster than the previous one, changelog is included and I edited the first post to be more informative!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Dyaheon
Member
**
Offline Offline

Activity: 121
Merit: 10


View Profile
January 30, 2012, 02:22:55 PM
 #16

~695MH/s on a 7970 at 1175/1375 clocks, with the command line from the OP.

Diablominer gives ~700MH/s with less interface lag though.
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 30, 2012, 02:32:01 PM
 #17

~695MH/s on a 7970 at 1175/1375 clocks, with the command line from the OP.

Diablominer gives ~700MH/s with less interface lag though.

Some reports indicate, that a lower AGGRESSION could lead to higher values, but I can't confirm this for my machine.
I'm working hard on the next version, the optimisation is not finished...

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
wndrbr3d
Hero Member
*****
Offline Offline

Activity: 914
Merit: 500


View Profile
February 01, 2012, 07:48:43 PM
 #18

@Diapolo:

So do you have any opinions on GCN vs. VLIW4/5 when it comes to optimizations for the mining cores that are out there? Do you expect to CGN to be a nice step forward, or at best, should we be happy that CGN didn't nerf performance when compared to the VLIW4/5 architecture?

I'm curious to get your feedback. Smiley

Thanks for all your work!
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 02, 2012, 12:21:38 PM
Last edit: February 03, 2012, 06:48:41 AM by Diapolo
 #19

New version 02-02-2012 is ready for download. Release highlights include OpenCL 1.1 global offset parameter support (THX DiabloD3 for the idea - damn it sucked to do this in Python ^^), fixed non VECTOR code path and faster kernel execution on GCN cards (achieved via saving instructions in the GPU ISA code).

download current version:
http://www.filedropper.com/diakgcn02-02-2012

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 02, 2012, 12:27:43 PM
 #20

@Diapolo:

So do you have any opinions on GCN vs. VLIW4/5 when it comes to optimizations for the mining cores that are out there? Do you expect to CGN to be a nice step forward, or at best, should we be happy that CGN didn't nerf performance when compared to the VLIW4/5 architecture?

I'm curious to get your feedback. Smiley

Thanks for all your work!

I think GCN is a great step in the right direction. It's far easier for me AND the compiler to write / generate code, which results in pretty good utilization of the GPUs compute units. The CUs in contrast to VLIW4/VLIW5 units consist of independant vector units, which makes code or wavefronts on the GPU depend less on results of other units. The OpenCL compiler for GCN feels far more matured, than it was after the relase of the 69XX series of cards. The drawback seems to be, that the current kernels have all very similar performance levels Cheesy.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 02, 2012, 07:30:30 PM
 #21

Okay, I've downloaded the kernel and am trying it now.  So far, not bad.  The Vectors4 still has the whole issue with showing twice as many hashes as are actually computing, but I think that has to do with the init file as you said there were incompatibilities in the code when using the VECTORS4 option.  Also, why the (u) variable when using bitselect?
I like how you used the nonce here.  It seems that it could be better than using a series of if-else statements.
You've managed to keep the instructions low, but somehow the darn thing's not hashing faster.  Probably because it's not repeating the same task again and again for and with the same variables.  But, as you said, it's optimized for GCN so I have no idea.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 02, 2012, 07:50:46 PM
 #22

Okay, I've downloaded the kernel and am trying it now.  So far, not bad.  The Vectors4 still has the whole issue with showing twice as many hashes as are actually computing, but I think that has to do with the init file as you said there were incompatibilities in the code when using the VECTORS4 option.  Also, why the (u) variable when using bitselect?
I like how you used the nonce here.  It seems that it could be better than using a series of if-else statements.
You've managed to keep the instructions low, but somehow the darn thing's not hashing faster.  Probably because it's not repeating the same task again and again for and with the same variables.  But, as you said, it's optimized for GCN so I have no idea.

VEC4 is bugged until I say it got fixed, sorry Cheesy. The (u) is a typecast because afther round 64 I use some mixed scalar and vector values and this is needed to cast them even.
For me this is the fastest version on my 7970 ... but it seems no one cares to try it (on GCN cards).

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
sveetsnelda
Hero Member
*****
Offline Offline

Activity: 642
Merit: 500


View Profile
February 02, 2012, 10:42:27 PM
 #23

I tried it, and I'm sorry that I haven't reported back.  Work has been chaotic.

The only way that I can get a similar hashrate compared to DiabloMiner with this kernel is to use a very high intensity (greater than 10).  By doing this though, CPU usage skyrockets and I burn up more wattage than the hashrate increase is worth.  I can make a few changes to the Poclbm kernel included with CGMiner though and get 96 percent of the performance of DiabloMiner while leaving the intensity at 9.  By using CGMiner, I am able to use a backup pools, RPC, thermal controls, etc, etc.  This more than makes up for the ~4 percent loss in performance.  I'm not at home right now to look at every change, but defining the Ch and Ma functions to use Bitselect is basically all that was needed.

I'll try to send you a PM tonight with more details.

14u2rp4AqFtN5jkwK944nn741FnfF714m7
blissfulyoshi
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
February 03, 2012, 12:41:43 AM
 #24

What I was asking about the name earlier in the previous thread is why the naming of this version changed?

current: diaggcn
thread title/previous: diakgcn

Oh well, minor thing, just changed the name of my inputs into phoenix. Keep up the good work.
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 03, 2012, 06:43:32 AM
 #25

What I was asking about the name earlier in the previous thread is why the naming of this version changed?

current: diaggcn
thread title/previous: diakgcn

Oh well, minor thing, just changed the name of my inputs into phoenix. Keep up the good work.

ROFL ... I did a typing error, wow that is hard. Will upload a fixed one asap Cheesy. Sorry for the confusion, yesterday was a bit hard Cheesy.

Update: Fixed my typo ;-), download is back!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 03, 2012, 06:59:56 AM
 #26

I tried it, and I'm sorry that I haven't reported back.  Work has been chaotic.

The only way that I can get a similar hashrate compared to DiabloMiner with this kernel is to use a very high intensity (greater than 10).  By doing this though, CPU usage skyrockets and I burn up more wattage than the hashrate increase is worth.  I can make a few changes to the Poclbm kernel included with CGMiner though and get 96 percent of the performance of DiabloMiner while leaving the intensity at 9.  By using CGMiner, I am able to use a backup pools, RPC, thermal controls, etc, etc.  This more than makes up for the ~4 percent loss in performance.  I'm not at home right now to look at every change, but defining the Ch and Ma functions to use Bitselect is basically all that was needed.

I'll try to send you a PM tonight with more details.

I really would like to port this one into CGMiner (or help in getting it ported), but I did not have the time to do so AND I guess I need help in doing commits for CGMiner. I will send a PM to Con, perhaps he is interested ...

By the way, I use AGGRESSION=12 with this kernel and get ~75% utilization on 1 core. Not good, but could be worse Cheesy!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 03, 2012, 10:43:52 AM
 #27

download current version:
http://www.filedropper.com/diakgcn03-02-2012_1

This release fixes the bugged VECTORS4 code, which works again (tested on 7970 and 6550D) and could speedup things for VLIW4 / VLIW5 GPUs with WORKSIZE=128, just try it. There are no further changes for GCN in conjunction with VECTORS2 since 03-02-2012.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
blissfulyoshi
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
February 03, 2012, 04:40:39 PM
 #28

More testing!!!!!

2.5
DiakGCN VECTORS WORKSIZE=128: 247MHps
DiakGCN VECTORS2 WORKSIZE=128: 280-281MHps
DiakGCN VECTORS4 WORKSIZE=128: 284MHps....It looks like your old Phoenix kernal is finally beaten for me. Now just need to surpass my cgminer scores of 290MHps xD

CPU at 25-30% on my C2D.

Increases all across the board. Congratz!
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 04, 2012, 02:12:41 PM
 #29

Just tried your newer kernel and just about crapped myself.  I'm seeing some very competitive numbers with phatk2 now and I love the verbosity.  I see you decided to move the (u) values from the bitselect.  Did that help to speed things along?  I figured that if BFI_INT didn't have them, there was a major difference in something and one of them had to be slower.
I like how you used the global offset to your advantage.  (GOFFSET)
I am impressed.  You've been busy and I can see why.  If I was capable of hashing faster, I would totally send you some coin for your efforts.  Given a few months, I should.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 04, 2012, 02:36:03 PM
 #30

Also, if I may, it doesn't look like uu needs to be set for GOFFSET as base doesn't appear to even be used.  I'm guessing that was your intention in the first place.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
TurdHurdur
Full Member
***
Offline Offline

Activity: 216
Merit: 100


View Profile
February 04, 2012, 04:32:12 PM
 #31

I get ~10Mhash/s more on my 5870 using:

Code:
VECTORS4 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

with this kernel for regular desktop usage.
Compared to my old phatk2 line:

Code:
VECTORS AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP


Though, this kernel doesn't seem to help my higher-aggression-set card(also a 5870) in my crossfire setup compared to your 2011-12-21 phatk_dia.
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 04, 2012, 04:52:20 PM
 #32

I get ~10Mhash/s more on my 5870 using:

Code:
VECTORS4 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

with this kernel for regular desktop usage.
Compared to my old phatk2 line:

Code:
VECTORS AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP


Though, this kernel doesn't seem to help my higher-aggression-set card(also a 5870) in my crossfire setup compared to your 2011-12-21 phatk_dia.
What did you get apples to apples?  As in using VECTORS4 with phatk2?

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
TurdHurdur
Full Member
***
Offline Offline

Activity: 216
Merit: 100


View Profile
February 04, 2012, 05:36:07 PM
 #33

What did you get apples to apples?  As in using VECTORS4 with phatk2?

Oh, crap, significantly more. Guess I should've tried phatk2 with VECTORS4 before...

Edit: Blah, ignore above postings. I've been editing the kernel files.
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 04, 2012, 05:57:04 PM
 #34

download current version:
http://www.filedropper.com/diakgcn04-02-2012

This version features uint8 vectors support, which is activated via VECTORS8 switch. This was beneficial on my VLIW5 6550D, but is pretty slow on GCN. Another switch GOFFSET was added, which can be used to disable the automatic usage of the global offset parameter (use GOFFSET=false to disable global offset). Perhaps it's faster for some to use the old way of generating the nonces in the kernel, so play around with it Smiley.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 04, 2012, 05:57:43 PM
 #35

What did you get apples to apples?  As in using VECTORS4 with phatk2?

Oh, crap, significantly more. Guess I should've tried phatk2 with VECTORS4 before...

Edit: Blah, ignore above postings. I've been editing the kernel files.

Would you mind to try the VECTORS8 version and report back?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
TurdHurdur
Full Member
***
Offline Offline

Activity: 216
Merit: 100


View Profile
February 04, 2012, 06:45:40 PM
 #36

Would you mind to try the VECTORS8 version and report back?

Dia

I'm using Catalyst 12.1, 875/1225 clocks, same manufacturer/model 5870s on Windows 7.

https://bitcointalk.org/index.php?topic=6458.msg718648#msg718648 kernel:
Code:
VECTORS4 FASTLOOP=false AGGRESSION=10 WORKSIZE=128 BFI_INT

Max: ~400Mhash/s

Code:
VECTORS4 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

Max: ~390Mhas/s


Your new diakcgn kernel:

Code:
VECTORS8 FASTLOOP=false AGGRESSION=10 WORKSIZE=128 BFI_INT

Max: ~354Mhash/s

Code:
VECTORS8 AGGRESSION=6 WORKSIZE=128 BFI_INT FASTLOOP

Max: ~352Mhash/s
blissfulyoshi
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
February 04, 2012, 06:56:07 PM
 #37

I think that posting sytle looks nice, I'll copy.

2.5 (6870 on 11.11)
Code:
VECTORS4 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 284Mhash/s

Code:
VECTORS8 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 284Mhash/s

2.6 (6870 on 11.12, 50MHz slower on the GPU clock than the one on 2.5)
Code:
VECTORS2 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 262Mhash/s

Code:
VECTORS4 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 275Mhash/s

Code:
VECTORS8 DEVICE=0 BFI_INT AGGRESSION=12 WORKSIZE=128
Average: 268Mhash/s
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 05, 2012, 03:39:41 AM
 #38

VECTORS4 WORKSIZE=128 with GOFFSET=false 14.45 Mhash/s
VECTORS4 WORKSIZE=128 without GOFFSET=false 14.46 Mhash/s
VECTORS8 WORKSIZE=128 with GOFFSET=false 14.46 Mhash/s
VECTORS8 WORKSIZE=128 without GOFFSET=false 14.47 Mhash/s

VECTORS4 WORKSIZE=64 with GOFFSET=false 14.49 Mhash/s
VECTORS4 WORKSIZE=64 without GOFFSET=false 14.50 Mhash/s
VECTORS8 WORKSIZE=64 with GOFFSET=false 14.55 Mhash/s
VECTORS8 WORKSIZE=64 without GOFFSET=false 14.50 Mhash/s

VECTORS4 WORKSIZE=32 with GOFFSET=false 14.46 Mhash/s
VECTORS4 WORKSIZE=32 without GOFFSET=false 14.47 Mhash/s
VECTORS8 WORKSIZE=32 with GOFFSET=false 14.50 Mhash/s
VECTORS8 WORKSIZE=32 without GOFFSET=false 14.48 Mhash/s

*High fives*  Playing around with VECTORS8 has done some good.  ^_^  And hardly anyone believed me that using 256-byte integers could pay-off.
I'm going to "try" to do something with the nonce code in phatk2 by copying the nonce code from your kernel and see what happens.  I really wouldn't have known how to do it without you.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 05, 2012, 09:17:36 AM
 #39

Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of.  I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done.  Higher-end cards will, of course, see better results than lower-end ones.  I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 05, 2012, 04:44:38 PM
 #40

Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of.  I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done.  Higher-end cards will, of course, see better results than lower-end ones.  I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.

I could implement uint16, should be pretty straight forward, but massive vectorisation is really something GCN does not like currently.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 04:59:45 AM
 #41

Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of.  I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done.  Higher-end cards will, of course, see better results than lower-end ones.  I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.

I could implement uint16, should be pretty straight forward, but massive vectorisation is really something GCN does not like currently.

Dia
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/3
I think it might be because the full 16 vectors are loaded and unloaded to make room for anything else that needs to be computed.   Undecided
In theory, 16 vectors at once is the best approach, but that only applies if we're doing math for only the 16 vectors as that's the maximum the ALUs can hold.
In other words, the moment something else needs to be loaded, it has to pull the entire 512-byte integer from the ALUs to put into the cache, load the data to be computed, unload it, then reload the 512-byte integers.  But the GCN is supposedly a true 16 vector design so I think the problem is the overhead that's created loading and unloading.  With the 8 vectors, did you try the worksize of 64 to see if it was any faster?

PS Bad news--new Phoenix 2 miner.  I've suggested they make changes to the phatk2 kernel like you've made for your GCN here.  Like adding the GOFFSET option and increasing vector sizes.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 07:34:18 AM
Last edit: February 06, 2012, 09:02:30 AM by d3m0n1q_733rz
 #42

Anyhow, going back to what I was saying, Dia, I think that the best kernel design for GCN is one which will compute four 512-byte integers.  Since it can compute one in 4 cycles or 4 in 4 cycles, it seems best to attempt to compute 4 sets of 16 vectors to the fullest extent of the ALUs.  Alternatively, you could compute 3 sets and leave the last SIMD for computing other works required by the kernel such as nonce and the like.  So, multi-threading is brought into play with the GCN processors.   Cool
The problem is that these aren't multi-GPUs, these are multi-SIMD GPUs which makes coding a little more tricky.
I might be a little over-zealous to think that these are capable of handling four times the amount of mining at one time, but it seems like the approach to take.
The biggest suggestion I could make, though, is to drop the worksize down to allow for the increased vectors.  You should see some improvement with VECTORS8, but I can't promise it so.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
February 06, 2012, 10:38:18 AM
Last edit: February 06, 2012, 10:48:27 AM by gat3way
 #43

There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.

To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.

Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.


There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 11:09:05 AM
Last edit: February 06, 2012, 11:28:50 AM by d3m0n1q_733rz
 #44

There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.

To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.

Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.


There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.
Alright, but when it came to 8 vectors, you can't argue with results.  I've posted the table of gains with the VLIW5 hardware I use.  And please read the papers on the GCN again (assuming you read them once) as it's clearly stated that, "Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront.  This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN."
Now, as I was saying, since the SIMDs are 16-wide and there are 4 of them.  Each SIMD could be loaded with 16 vectors each which would allow the calculations to be run on all of them without wasting any clock cycles.  Four 16-vectors at once sounds pretty good to me.
The Cayman takes 4 clock cycles due to SPs being used.  The GCN handles them in one.  You do the math.
Now, I don't know why Dia's been getting lower hash results with 8 vectors having more ALUs to handle them.  But I have the HD5450 and I get the highest hashing rate using 8 vectors and a worksize of 64.  You can find my results on the previous page.  Oddly enough, it's on the VLIW5 which isn't 8-wide.  8-wide would be using half of the ALUs of a single SIMD on a GCN.  So what gives?

BTW, I'm talking about the 79xx series.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
February 06, 2012, 09:48:18 PM
 #45

I'm telling you again, you've gotten that wrong. The vector ALU unit on GCNs is not meant to map 1:1 with opencl's vectors. The GCN architecture is scalar in nature. The purpose of vector ALU units is to handle ALU operations that are handled per-workitem rather than those that are handled on a per-workgroup basis. The vector ALU operations take 4 cycles to execute as compared to the 1 cycle on the scalar unit. There might be an advantage to vectorization in some cases but that's not because the vector unit behaves as a 16-wide SIMD unit (which is wrong btw). The vector unit "appears" to operate as a SIMD one, but that comes at the price of the instruction latency.

There is now a section on GCN architecture on the official APP SDK documentation:

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Along with everything else, it is clearly stated there:

Quote
Notes –
•   Vectorization is no longer needed, nor desirable. If your code used to
    combine work-items in order to get better VLIW use, this is no longer
    required.


Anyway, this can be easily demonstrated. Here is a very simple OpenCL kernel that shifts a kernel argument and writes it into an output buffer. This is with uint16 vectors:


Code:
__kernel void test(uint16 in,__global uint16 *dest)
{
dest[get_global_id(0)] = in>>2;
}

Here is the ISA dump:

Code:
  s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dwordx4  s[20:23], s[12:15], 0x04           // 0000000C: C28A0D04
  s_buffer_load_dwordx4  s[24:27], s[12:15], 0x08           // 00000010: C28C0D08
  s_buffer_load_dwordx4  s[28:31], s[12:15], 0x0c           // 00000014: C28E0D0C
  s_buffer_load_dword  s2, s[12:15], 0x10                   // 00000018: C2010D10
  s_waitcnt     lgkmcnt(0)                                  // 0000001C: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000020: 93000010
  s_add_i32     s0, s0, s1                                  // 00000024: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 00000028: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 0000002C: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000030: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000034: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 00000038: 9008820B
  v_lshlrev_b32  v0, 6, v0                                  // 0000003C: 34000086
  v_add_i32     v0, vcc, s2, v0                             // 00000040: 4A000002
  v_mov_b32     v1, s0                                      // 00000044: 7E020200
  v_mov_b32     v2, s1                                      // 00000048: 7E040201
  v_mov_b32     v3, s3                                      // 0000004C: 7E060203
  v_mov_b32     v4, s8                                      // 00000050: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBF71000 80010100
  s_lshr_b32    s0, s28, 2                                  // 0000005C: 9000821C
  s_lshr_b32    s1, s29, 2                                  // 00000060: 9001821D
  s_lshr_b32    s2, s30, 2                                  // 00000064: 9002821E
  s_lshr_b32    s3, s31, 2                                  // 00000068: 9003821F
  s_waitcnt     expcnt(0)                                   // 0000006C: BF8C1F0F
  v_mov_b32     v1, s0                                      // 00000070: 7E020200
  v_mov_b32     v2, s1                                      // 00000074: 7E040201
  v_mov_b32     v3, s2                                      // 00000078: 7E060202
  v_mov_b32     v4, s3                                      // 0000007C: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:48 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000080: EBF71030 80010100
  s_lshr_b32    s0, s24, 2                                  // 00000088: 90008218
  s_lshr_b32    s1, s25, 2                                  // 0000008C: 90018219
  s_lshr_b32    s2, s26, 2                                  // 00000090: 9002821A
  s_lshr_b32    s3, s27, 2                                  // 00000094: 9003821B
  s_waitcnt     expcnt(0)                                   // 00000098: BF8C1F0F
  v_mov_b32     v1, s0                                      // 0000009C: 7E020200
  v_mov_b32     v2, s1                                      // 000000A0: 7E040201
  v_mov_b32     v3, s2                                      // 000000A4: 7E060202
  v_mov_b32     v4, s3                                      // 000000A8: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:32 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000AC: EBF71020 80010100
  s_lshr_b32    s0, s20, 2                                  // 000000B4: 90008214
  s_lshr_b32    s1, s21, 2                                  // 000000B8: 90018215
  s_lshr_b32    s2, s22, 2                                  // 000000BC: 90028216
  s_lshr_b32    s3, s23, 2                                  // 000000C0: 90038217
  s_waitcnt     expcnt(0)                                   // 000000C4: BF8C1F0F
  v_mov_b32     v1, s0                                      // 000000C8: 7E020200
  v_mov_b32     v2, s1                                      // 000000CC: 7E040201
  v_mov_b32     v3, s2                                      // 000000D0: 7E060202
  v_mov_b32     v4, s3                                      // 000000D4: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000D8: EBF71010 80010100
  s_endpgm                                                  // 000000E0: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 228;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 32;



Now there is the uint4 version:


Code:
__kernel void test(uint4 in,__global uint4 *dest)
{
dest[get_global_id(0)] = in>>2;
}


Code:
  s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dword  s2, s[12:15], 0x04                   // 0000000C: C2010D04
  s_waitcnt     lgkmcnt(0)                                  // 00000010: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000014: 93000010
  s_add_i32     s0, s0, s1                                  // 00000018: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 0000001C: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 00000020: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000024: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000028: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 0000002C: 9008820B
  v_lshlrev_b32  v0, 4, v0                                  // 00000030: 34000084
  v_add_i32     v0, vcc, s2, v0                             // 00000034: 4A000002
  v_mov_b32     v1, s0                                      // 00000038: 7E020200
  v_mov_b32     v2, s1                                      // 0000003C: 7E040201
  v_mov_b32     v3, s3                                      // 00000040: 7E060203
  v_mov_b32     v4, s8                                      // 00000044: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000048: EBF71000 80010100
  s_endpgm                                                  // 00000050: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 84;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 18;



As you can see, the IL->ISA backend did not even bother to map the vector operations to the vector unit, it rather used the scalar unit exclusively. The first version does the 16 scalar shifts and wastes 32 SGPRs, the second one does the 4 scalar shifts and wastes 18 SGPRs.

Now before you say "why is it behaving like that", there are several reasons for this. Once again, stop thinking about OpenCL as something that should map 1:1 to hardware. OpenCL is a high-level API. Even with IL, you don't have that control. You cannot directly influence how is the backend going to map on the hardware.

As for your improved results, I would advise you to have a look at the python host code and/or the share rate as reported by the pool you are using. It is likely that the progress indicator is not reporting the correct speed for some reason (wrong NDRange calculation, wrong divisor or something like that). I've done those experiments in the past with mine and others' bitcoin kernels and in all cases, the kernel performance dropped abruptly with vectorization above 4 (due to reduced occupancy).
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 11:10:38 PM
 #46

I'm telling you again, you've gotten that wrong. The vector ALU unit on GCNs is not meant to map 1:1 with opencl's vectors. The GCN architecture is scalar in nature. The purpose of vector ALU units is to handle ALU operations that are handled per-workitem rather than those that are handled on a per-workgroup basis. The vector ALU operations take 4 cycles to execute as compared to the 1 cycle on the scalar unit. There might be an advantage to vectorization in some cases but that's not because the vector unit behaves as a 16-wide SIMD unit (which is wrong btw). The vector unit "appears" to operate as a SIMD one, but that comes at the price of the instruction latency.

There is now a section on GCN architecture on the official APP SDK documentation:

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Along with everything else, it is clearly stated there:

Quote
Notes –
•   Vectorization is no longer needed, nor desirable. If your code used to
    combine work-items in order to get better VLIW use, this is no longer
    required.


Anyway, this can be easily demonstrated. Here is a very simple OpenCL kernel that shifts a kernel argument and writes it into an output buffer. This is with uint16 vectors:


Code:
__kernel void test(uint16 in,__global uint16 *dest)
{
dest[get_global_id(0)] = in>>2;
}

Here is the ISA dump:

Code:
  s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dwordx4  s[20:23], s[12:15], 0x04           // 0000000C: C28A0D04
  s_buffer_load_dwordx4  s[24:27], s[12:15], 0x08           // 00000010: C28C0D08
  s_buffer_load_dwordx4  s[28:31], s[12:15], 0x0c           // 00000014: C28E0D0C
  s_buffer_load_dword  s2, s[12:15], 0x10                   // 00000018: C2010D10
  s_waitcnt     lgkmcnt(0)                                  // 0000001C: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000020: 93000010
  s_add_i32     s0, s0, s1                                  // 00000024: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 00000028: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 0000002C: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000030: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000034: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 00000038: 9008820B
  v_lshlrev_b32  v0, 6, v0                                  // 0000003C: 34000086
  v_add_i32     v0, vcc, s2, v0                             // 00000040: 4A000002
  v_mov_b32     v1, s0                                      // 00000044: 7E020200
  v_mov_b32     v2, s1                                      // 00000048: 7E040201
  v_mov_b32     v3, s3                                      // 0000004C: 7E060203
  v_mov_b32     v4, s8                                      // 00000050: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBF71000 80010100
  s_lshr_b32    s0, s28, 2                                  // 0000005C: 9000821C
  s_lshr_b32    s1, s29, 2                                  // 00000060: 9001821D
  s_lshr_b32    s2, s30, 2                                  // 00000064: 9002821E
  s_lshr_b32    s3, s31, 2                                  // 00000068: 9003821F
  s_waitcnt     expcnt(0)                                   // 0000006C: BF8C1F0F
  v_mov_b32     v1, s0                                      // 00000070: 7E020200
  v_mov_b32     v2, s1                                      // 00000074: 7E040201
  v_mov_b32     v3, s2                                      // 00000078: 7E060202
  v_mov_b32     v4, s3                                      // 0000007C: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:48 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000080: EBF71030 80010100
  s_lshr_b32    s0, s24, 2                                  // 00000088: 90008218
  s_lshr_b32    s1, s25, 2                                  // 0000008C: 90018219
  s_lshr_b32    s2, s26, 2                                  // 00000090: 9002821A
  s_lshr_b32    s3, s27, 2                                  // 00000094: 9003821B
  s_waitcnt     expcnt(0)                                   // 00000098: BF8C1F0F
  v_mov_b32     v1, s0                                      // 0000009C: 7E020200
  v_mov_b32     v2, s1                                      // 000000A0: 7E040201
  v_mov_b32     v3, s2                                      // 000000A4: 7E060202
  v_mov_b32     v4, s3                                      // 000000A8: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:32 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000AC: EBF71020 80010100
  s_lshr_b32    s0, s20, 2                                  // 000000B4: 90008214
  s_lshr_b32    s1, s21, 2                                  // 000000B8: 90018215
  s_lshr_b32    s2, s22, 2                                  // 000000BC: 90028216
  s_lshr_b32    s3, s23, 2                                  // 000000C0: 90038217
  s_waitcnt     expcnt(0)                                   // 000000C4: BF8C1F0F
  v_mov_b32     v1, s0                                      // 000000C8: 7E020200
  v_mov_b32     v2, s1                                      // 000000CC: 7E040201
  v_mov_b32     v3, s2                                      // 000000D0: 7E060202
  v_mov_b32     v4, s3                                      // 000000D4: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000D8: EBF71010 80010100
  s_endpgm                                                  // 000000E0: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 228;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 32;



Now there is the uint4 version:


Code:
__kernel void test(uint4 in,__global uint4 *dest)
{
dest[get_global_id(0)] = in>>2;
}


Code:
  s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dword  s2, s[12:15], 0x04                   // 0000000C: C2010D04
  s_waitcnt     lgkmcnt(0)                                  // 00000010: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000014: 93000010
  s_add_i32     s0, s0, s1                                  // 00000018: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 0000001C: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 00000020: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000024: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000028: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 0000002C: 9008820B
  v_lshlrev_b32  v0, 4, v0                                  // 00000030: 34000084
  v_add_i32     v0, vcc, s2, v0                             // 00000034: 4A000002
  v_mov_b32     v1, s0                                      // 00000038: 7E020200
  v_mov_b32     v2, s1                                      // 0000003C: 7E040201
  v_mov_b32     v3, s3                                      // 00000040: 7E060203
  v_mov_b32     v4, s8                                      // 00000044: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000048: EBF71000 80010100
  s_endpgm                                                  // 00000050: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 84;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 18;



As you can see, the IL->ISA backend did not even bother to map the vector operations to the vector unit, it rather used the scalar unit exclusively. The first version does the 16 scalar shifts and wastes 32 SGPRs, the second one does the 4 scalar shifts and wastes 18 SGPRs.

Now before you say "why is it behaving like that", there are several reasons for this. Once again, stop thinking about OpenCL as something that should map 1:1 to hardware. OpenCL is a high-level API. Even with IL, you don't have that control. You cannot directly influence how is the backend going to map on the hardware.

As for your improved results, I would advise you to have a look at the python host code and/or the share rate as reported by the pool you are using. It is likely that the progress indicator is not reporting the correct speed for some reason (wrong NDRange calculation, wrong divisor or something like that). I've done those experiments in the past with mine and others' bitcoin kernels and in all cases, the kernel performance dropped abruptly with vectorization above 4 (due to reduced occupancy).
Which GPU are you using specifically?  It sounds like you're describing the Cayman.  And my results are accurate.  It's more likely that the results dropped due to register spilling or high worksize settings.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
February 06, 2012, 11:19:03 PM
 #47

You don't recognize Cayman ISA from Tahiti ISA? Smiley

Well, actually this is cross-compiled using the cl_amd_offline_devices extension. It is an AMD extension that lets you compile binary kernels for all hardware supported by the driver. The system I got the dumps and built kernels on is a 6870 one. It does not matter though as the generated binary is the same as the one you would get from clBuildProgram() on 79xx.
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 11:45:46 PM
 #48

You don't recognize Cayman ISA from Tahiti ISA? Smiley

Well, actually this is cross-compiled using the cl_amd_offline_devices extension. It is an AMD extension that lets you compile binary kernels for all hardware supported by the driver. The system I got the dumps and built kernels on is a 6870 one. It does not matter though as the generated binary is the same as the one you would get from clBuildProgram() on 79xx.
Actually, I was talking about the Tahiti vs. Cayman as Tahiti uses GCN and Cayman uses VLIW.  The Tahiti GPU is different from the others of the 7xxx series because it's based on the GCN architecture which contains four full 16-wide vector units.  From what you've told me, the best settings for the 79xx series cards will be with 2 vectors and extremely high worksizes, but that data is based on the lower-end VLIW-based 7xxx cards in the series.  Since it takes 8 cycles to complete a group on VLIW, vectorization seems to be a good option.
The document you posted talks about the Southern Island cards in general, but not the specifics of the 16-wide vectors.
My argument is that VLIW can use more instructions, but GCN can use more vectors.  The idea I'm trying to convey is to keep the vectors high and the instructions required to be used on them low.  But I can't seem to avoid the darn spillover in the registers.  >_<

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
February 07, 2012, 12:04:24 AM
 #49

Nope. There is no reason extremely high (what's extreme btw? 256?) worksizes would work best. My bet would be 64 would work best. I might not be correct about this since I lack data and I lack data cause I am too lazy to profile current bitcoin kernels Smiley

The dumps I posted you are from Tahiti architecture, you can see no VLIW bundles and no clauses there and the GCN ISA is clearly different from the VLIW one Smiley Also, there is no reason why 2-component vectors would work best on Tahiti. Why do you think uint2 would work best? I don't think so. It might work well. It might not. You have run that through SKA or sprofile?

Once again (I am kinda tired of this so I am not going to reinstate that anymore), there is no 1:1 mapping between OpenCL vectors and the 79xx's vector ALU units. 79xx cannot "use more vectors" as "using more vectors" does not mean "use less instructions" on GCN hardware. Though frankly said I don't see a reason why am I arguing about that. Actually you are free to profile and benchmark. Again, do profile Smiley

d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 07, 2012, 12:36:44 AM
 #50

Nope. There is no reason extremely high (what's extreme btw? 256?) worksizes would work best. My bet would be 64 would work best. I might not be correct about this since I lack data and I lack data cause I am too lazy to profile current bitcoin kernels Smiley

The dumps I posted you are from Tahiti architecture, you can see no VLIW bundles and no clauses there and the GCN ISA is clearly different from the VLIW one Smiley Also, there is no reason why 2-component vectors would work best on Tahiti. Why do you think uint2 would work best? I don't think so. It might work well. It might not. You have run that through SKA or sprofile?

Once again (I am kinda tired of this so I am not going to reinstate that anymore), there is no 1:1 mapping between OpenCL vectors and the 79xx's vector ALU units. 79xx cannot "use more vectors" as "using more vectors" does not mean "use less instructions" on GCN hardware. Though frankly said I don't see a reason why am I arguing about that. Actually you are free to profile and benchmark. Again, do profile Smiley


>_<  That's what I'm trying to do.  I'm trying to allow it to use all 16 vectors at once instead of using smaller vectors to achieve the same thing.  In this way, the instructions aren't repeated and the overhead is removed.  But I've seen best results while using 8 vectors and a worksize of 64 or 32.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 07, 2012, 03:49:26 PM
 #51

If you guys did not see it, there is a new Phoenix 2 beta for which I released a DiaKGCN preview, see here:
https://bitcointalk.org/index.php?topic=62765.msg734465#msg734465

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
-ck
Legendary
*
Offline Offline

Activity: 4102
Merit: 1632


Ruu \o/


View Profile WWW
February 07, 2012, 10:51:36 PM
 #52

So how are we going diapolo? Is your kernel ready for me to port it to cgminer Wink

Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel
2% Fee Solo mining at solo.ckpool.org
-ck
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 06:53:01 AM
 #53

So how are we going diapolo? Is your kernel ready for me to port it to cgminer Wink

Hey Con,

It's ready for getting assimilated Cheesy, only thing is I really need your help for this. There are some differences in the supplied kernel variables and compiler arguments, which we should take a look at. Another difference is the output buffer, which is currently not compatible to the CGMINER code (but could be changed rather easy). I added another method of nonce calculation via OpenCL 1.1 global offset, so a flag or function to detect OpenCL 1.1 would be needed in the CGMINER API.

At the end of the week I should have a bit more time, than I have now, but the phase of planning can start as soon as you give me a go (and take me by the hand ^^). What would you suggest as a first step?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
-ck
Legendary
*
Offline Offline

Activity: 4102
Merit: 1632


Ruu \o/


View Profile WWW
February 08, 2012, 07:14:44 AM
 #54

Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.

Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel
2% Fee Solo mining at solo.ckpool.org
-ck
-ck
Legendary
*
Offline Offline

Activity: 4102
Merit: 1632


Ruu \o/


View Profile WWW
February 08, 2012, 07:46:56 AM
 #55

Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
Investigate the imported diakgcn. The only significant change is to the output code, but I get no  shares yet...
https://github.com/ckolivas/cgminer/blob/diakgcn/diakgcn120208.cl

Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel
2% Fee Solo mining at solo.ckpool.org
-ck
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 10:15:06 AM
 #56

Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
Investigate the imported diakgcn. The only significant change is to the output code, but I get no  shares yet...
https://github.com/ckolivas/cgminer/blob/diakgcn/diakgcn120208.cl

Cool, I'll take a look at it ...

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 08, 2012, 02:02:05 PM
 #57

Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 02:06:02 PM
 #58

Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 08, 2012, 02:56:31 PM
 #59

Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia
Just remember to use #ifndef GOFFSET with the constant that sets base to uu.
I've also added GOFFSET to Phatk2!  *Happy dance*

It's tested and works.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 03:17:28 PM
 #60

Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia
Just remember to use #ifndef GOFFSET with the constant that sets base to uu.
I've also added GOFFSET to Phatk2!  *Happy dance*

It's tested and works.

Nice work Smiley, got your version faster on your machine?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 08, 2012, 03:31:03 PM
 #61

Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia
Just remember to use #ifndef GOFFSET with the constant that sets base to uu.
I've also added GOFFSET to Phatk2!  *Happy dance*

It's tested and works.

Nice work Smiley, got your version faster on your machine?

Dia
I'm still playing around with it.  I think I'll take a nap and see how well it does.  I'm hoping that it'll work better, but I'm still moving things around.  I'm having trouble keeping Vectors8 from spilling over into memory though.
As for your GCN version, I haven't messed with it as much lately.  I'll get back to it soon.  I'm just wanting to even things out and see what works best with certain ideas.

Also, in your code here:
Code:
#else
#ifdef GOFFSET
u nonce = (uint)get_global_id(0);
#else
u nonce = ((uint)get_group_id(0) * (uint)WORKSIZExVECSIZE) + (uint)get_local_id(0) + base;
Should we add (u) to the GOFFSET nonce?
Code:
#else
#ifdef GOFFSET
u nonce = (uint)get_global_id(0) + (u);
#else
u nonce = ((uint)get_group_id(0) * (uint)WORKSIZExVECSIZE) + (uint)get_local_id(0) + base;
  Like this

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 03:39:32 PM
 #62

No, (u) is not a variable, it's only used as a type-cast in front of variables, where u is replaced by uint, uint2, uint4 or uint8 (depends on the used vector width).

Example:

#define u uint
ulong Test_ulong = 17
uint Test_uint = (u)Test_ulong -> replaced with uint Test_uint = (uint)Test_ulong

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 08, 2012, 03:44:59 PM
 #63

Right, so not adding uint doesn't cause an offset?
I ran into something with the Phatk2 kernel which added uint as PreVal0 and PreVal4 that were required otherwise the offset caused problems (as far as I can tell anyway).  So I just wanted to be sure.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 03:50:29 PM
 #64

Right, so not adding uint doesn't cause an offset?
I ran into something with the Phatk2 kernel which added uint as PreVal0 and PreVal4 that were required otherwise the offset caused problems (as far as I can tell anyway).  So I just wanted to be sure.

You simply can't add (u) or uint, it has no value and should generate a syntax error in the compiler.
A variable can be of type uint (= unsigned integer -> 4 Byte storage for unsigned integer values).
Perhaps you could create an own thread where we can discuss your version as our discussion here bloats this thread Cheesy.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 08, 2012, 05:10:53 PM
 #65

Right, so not adding uint doesn't cause an offset?
I ran into something with the Phatk2 kernel which added uint as PreVal0 and PreVal4 that were required otherwise the offset caused problems (as far as I can tell anyway).  So I just wanted to be sure.

You simply can't add (u) or uint, it has no value and should generate a syntax error in the compiler.
A variable can be of type uint (= unsigned integer -> 4 Byte storage for unsigned integer values).
Perhaps you could create an own thread where we can discuss your version as our discussion here bloats this thread Cheesy.

Dia
*Shrugs*  It works.  But okay, I'll toss together a forum later on.  Right now, I'm groggy and am going to take a nap.  And I used (u)(0)...which equals 0 so...zzz...I'm an idiot...zzz...

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 12, 2012, 10:43:19 AM
 #66

So, Dia, how goes the kernel programming?  I ask because I'm curious; not because I wanted to bump your thread or anything like that.   Grin
Have you gotten the CGminer porting problem solved yet?

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 12, 2012, 01:40:55 PM
 #67

So, Dia, how goes the kernel programming?  I ask because I'm curious; not because I wanted to bump your thread or anything like that.   Grin
Have you gotten the CGminer porting problem solved yet?

Interest is low, most people don't seem to care about it ... but I'm still working on DiaKGCN Smiley.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 12, 2012, 02:10:33 PM
 #68

So, Dia, how goes the kernel programming?  I ask because I'm curious; not because I wanted to bump your thread or anything like that.   Grin
Have you gotten the CGminer porting problem solved yet?

Interest is low, most people don't seem to care about it ... but I'm still working on DiaKGCN Smiley.

Dia
The problem with appeal might be in the $560 card required to make full use of it.  But you've included some really fine ideas and work into this kernel.  Honestly, I wish I could help more.  All I can do is suggest different methods of achieving the same thing.  Like when you created another variable to remove const u base when used with GOFFSET, I suggested, "Why not just throw an #ifndef GOFFSET statement into the void search?"
Your methods are really genius.  Heck, I have a difficult time following your code just because I have to look up some of the methods you use.  So keep it up!  I'm not getting as much help with the Phatk2 mod as I had hoped and I've only made it as far as moving around some math and changing the nonce method, but I want to see what these cards are really capable of if we push them to their limits and filter out all the crap.
BTW, I sort of borrowed your direct nonce to miner idea.  I'm actually a little surprised that I didn't think of doing the same thing.  >_<  Who really needs nonce anyway?

Something to ponder and hopefully never attempt:  What would happen if we could write the entire SHA-256 algorithm in a single and really long line of code?  @_@  Would it even be possible?! 

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
e21
Member
**
Offline Offline

Activity: 105
Merit: 10


View Profile
February 13, 2012, 09:24:12 PM
 #69

So, Dia, how goes the kernel programming?  I ask because I'm curious; not because I wanted to bump your thread or anything like that.   Grin
Have you gotten the CGminer porting problem solved yet?

Interest is low, most people don't seem to care about it ... but I'm still working on DiaKGCN Smiley.

Dia

I look forward to testing this out in a couple days once the HD 7770 is out  Grin
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 14, 2012, 05:49:37 AM
 #70

So, Dia, how goes the kernel programming?  I ask because I'm curious; not because I wanted to bump your thread or anything like that.   Grin
Have you gotten the CGminer porting problem solved yet?

Interest is low, most people don't seem to care about it ... but I'm still working on DiaKGCN Smiley.

Dia

I look forward to testing this out in a couple days once the HD 7770 is out  Grin

7770 should not be GCN, but VLIW4 afaik Wink.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 14, 2012, 08:56:20 AM
 #71

So, Dia, how goes the kernel programming?  I ask because I'm curious; not because I wanted to bump your thread or anything like that.   Grin
Have you gotten the CGminer porting problem solved yet?

Interest is low, most people don't seem to care about it ... but I'm still working on DiaKGCN Smiley.

Dia

I look forward to testing this out in a couple days once the HD 7770 is out  Grin

7770 should not be GCN, but VLIW4 afaik Wink.

Dia
I second that.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Roadhog2k5
Full Member
***
Offline Offline

Activity: 131
Merit: 100



View Profile
February 14, 2012, 05:02:21 PM
 #72

7770 is GCN.
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 14, 2012, 05:24:19 PM
 #73

7770 is GCN.

If 7770 is Cape Verde, I made a mistake with my assumption ... thought 79XX = highend (GCN), 78XX =midrange (GCN) and 77XX = lowend with VLIW4.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
e21
Member
**
Offline Offline

Activity: 105
Merit: 10


View Profile
February 14, 2012, 06:07:11 PM
 #74

7770 is GCN.

If 7770 is Cape Verde, I made a mistake with my assumption ... thought 79XX = highend (GCN), 78XX =midrange (GCN) and 77XX = lowend with VLIW4.

Dia

No worries  Grin Nope; HD 7750 and up are GCN, everything 76xx and under is 40nm VLIW5 I believe (so as to be compatible with CrossFire using Llano A series APUs)
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 14, 2012, 07:26:59 PM
 #75

7770 is GCN.

If 7770 is Cape Verde, I made a mistake with my assumption ... thought 79XX = highend (GCN), 78XX =midrange (GCN) and 77XX = lowend with VLIW4.

Dia

No worries  Grin Nope; HD 7750 and up are GCN, everything 76xx and under is 40nm VLIW5 I believe (so as to be compatible with CrossFire using Llano A series APUs)
http://lenzfire.com/2011/12/entire-gcn-lineup-hd-7000-series-specs-and-price-revealed-60538/
Verified!
Though, you might be better off getting the 7790 for the price.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
e21
Member
**
Offline Offline

Activity: 105
Merit: 10


View Profile
February 17, 2012, 02:12:33 AM
 #76

Hi Dia!

Tried your kernel with the new 7770.. BFI_INT crashes phoenix, so I had to tweak your kernel so it wasn't enabled by default. Getting 148.5MHs at 1000MHz, using VECTORS, this seems to be about the same as I get with phatk unfortunately  Undecided

I tried to get your kernel to work with Phoenix 2.0 rc1 but when I specify kernel = diakgcn in the config file and try to run it, I get 'failed to load kernel diakgcn'. Any chance of updating your kernel to work with the new phoenix? I get 161MHs using phatk2 and phoenix 2.0rc1 when autoconfigure = enabled, which is the highest I have gotten from this card at stock speed.

Thanks!
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 17, 2012, 05:10:48 AM
 #77

Hi Dia!

Tried your kernel with the new 7770.. BFI_INT crashes phoenix, so I had to tweak your kernel so it wasn't enabled by default. Getting 148.5MHs at 1000MHz, using VECTORS, this seems to be about the same as I get with phatk unfortunately  Undecided

I tried to get your kernel to work with Phoenix 2.0 rc1 but when I specify kernel = diakgcn in the config file and try to run it, I get 'failed to load kernel diakgcn'. Any chance of updating your kernel to work with the new phoenix? I get 161MHs using phatk2 and phoenix 2.0rc1 when autoconfigure = enabled, which is the highest I have gotten from this card at stock speed.

Thanks!
If you want to disable BFI_INT, the easiest and quickest method is BFI_INT=false.  As for diakgcn, you might play around with it a bit with different vector types, GOFFSET=false and different worksizes.  The same can be said for phatk2.  But don't knock it.  It's some really wonderful work.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 17, 2012, 07:23:58 AM
 #78

Hi Dia!

Tried your kernel with the new 7770.. BFI_INT crashes phoenix, so I had to tweak your kernel so it wasn't enabled by default. Getting 148.5MHs at 1000MHz, using VECTORS, this seems to be about the same as I get with phatk unfortunately  Undecided

I tried to get your kernel to work with Phoenix 2.0 rc1 but when I specify kernel = diakgcn in the config file and try to run it, I get 'failed to load kernel diakgcn'. Any chance of updating your kernel to work with the new phoenix? I get 161MHs using phatk2 and phoenix 2.0rc1 when autoconfigure = enabled, which is the highest I have gotten from this card at stock speed.

Thanks!

Can you tell me, what the device name is for your 7770? It should be displayed, if you use verbose mode with Phoenix (-v as switch in Phoenix 1 and verbose = true for Phoenix 2 in it's config). I don't supply any updates to the Phoenix 1.x version for Phoenix 2 there was a preview release. If you are interested I can put together a package with the last kernel and init for Phoenix 2 and send it to you for testing? That would be no public official release, as I have not done any documentation for the last changes to build a proper changelog.

Oh and did you use VECTORS2, as VECTORS is no supported switch for DiaKGCN!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
e21
Member
**
Offline Offline

Activity: 105
Merit: 10


View Profile
February 17, 2012, 07:30:05 PM
 #79

Hi Dia!

Tried your kernel with the new 7770.. BFI_INT crashes phoenix, so I had to tweak your kernel so it wasn't enabled by default. Getting 148.5MHs at 1000MHz, using VECTORS, this seems to be about the same as I get with phatk unfortunately  Undecided

I tried to get your kernel to work with Phoenix 2.0 rc1 but when I specify kernel = diakgcn in the config file and try to run it, I get 'failed to load kernel diakgcn'. Any chance of updating your kernel to work with the new phoenix? I get 161MHs using phatk2 and phoenix 2.0rc1 when autoconfigure = enabled, which is the highest I have gotten from this card at stock speed.

Thanks!

Can you tell me, what the device name is for your 7770? It should be displayed, if you use verbose mode with Phoenix (-v as switch in Phoenix 1 and verbose = true for Phoenix 2 in it's config). I don't supply any updates to the Phoenix 1.x version for Phoenix 2 there was a preview release. If you are interested I can put together a package with the last kernel and init for Phoenix 2 and send it to you for testing? That would be no public official release, as I have not done any documentation for the last changes to build a proper changelog.

Oh and did you use VECTORS2, as VECTORS is no supported switch for DiaKGCN!

Dia

Thanks for getting back to me Dia,

I would love it if you could send me the latest build to use with Phoenix 2!

Here is the output of Phoenix 1.7.5 using -v and VECTORS2 (VECTORS functions same as VECTORS2 so I guess it just defaults to VECTORS2) with your kernel, (7770 is Capeverde, not sure if that the name you were looking for):


C:\Bitcoin>phoenix.exe -u http://username.worker:password@api2.bitcoin.cz:
8332/ -k diakgcn -v AGGRESSION=9 WORKSIZE=128 VECTORS2 DEVICE=3 BFI_INT F3
[17/02/2012 11:22:30] using PyOpenCL version 0.92
[17/02/2012 11:22:30] checked nonces per kernel execution: 33554432
[17/02/2012 11:22:30] using VECTORS2, resulting global worksize is: 16777216
[17/02/2012 11:22:30] using local worksize of 128 (HW max. is 256)
[17/02/2012 11:22:30] BFI_INT patching not supported on Capeverde
[17/02/2012 11:22:30] OpenCL >= 1.1 supported, using global offset

[17/02/2012 11:22:31] Phoenix v1.7.5 starting...
[17/02/2012 11:22:32] Connected to server
[17/02/2012 11:22:32] Server gave new work; passing to WorkQueue
[17/02/2012 11:22:32] New block (WorkQueue)
[17/02/2012 11:22:32] Currently on block: 167217
[17/02/2012 11:22:32] Server gave new work; passing to WorkQueue
[17/02/2012 11:22:56] positive nonce (lo - 1): 2985823925
[17/02/2012 11:22:57] Result 0000000068f61663... accepted
[17/02/2012 11:23:05] Server gave new work; passing to WorkQueue
[17/02/2012 11:23:27] positive nonce (hi - 1): 3409286864
[17/02/2012 11:23:28] Result 00000000e79cdf84... accepted
[17/02/2012 11:23:33] positive nonce (lo - 1): 4227931077
[17/02/2012 11:23:34] Result 00000000d6000f90... accepted
[17/02/2012 11:23:34] Server gave new work; passing to WorkQueue
[17/02/2012 11:23:35] positive nonce (lo - 1): 171201403
[17/02/2012 11:23:36] Result 00000000b0162afe... accepted
[17/02/2012 11:24:03] Server gave new work; passing to WorkQueue
[17/02/2012 11:24:15] positive nonce (hi - 1): 1829993980
[17/02/2012 11:24:15] Result 000000006309b5fd... accepted
[17/02/2012 11:24:31] Server gave new work; passing to WorkQueue
[17/02/2012 11:25:00] Server gave new work; passing to WorkQueue
[17/02/2012 11:25:16] positive nonce (hi - 1): 1405120782
[17/02/2012 11:25:17] Result 000000005a4c086c... accepted
[150.81 Mhash/sec] [6 Accepted] [0 Rejected] [RPC (+LP)]

150MH/s is while clocked at 1240 MHZ

-ck
Legendary
*
Offline Offline

Activity: 4102
Merit: 1632


Ruu \o/


View Profile WWW
February 20, 2012, 05:25:22 AM
 #80

Diakgcn now working in cgminer 2.2.7 (included). Specify it with -k diakgcn. No you cannot insert a phoenix orientated diakgcn kernel into cgminer.

Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel
2% Fee Solo mining at solo.ckpool.org
-ck
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 20, 2012, 06:20:32 AM
 #81

Diakgcn now working in cgminer 2.2.7 (included). Specify it with -k diakgcn. No you cannot insert a phoenix orientated diakgcn kernel into cgminer.

Great stuff and a thank you, I will add that information to the front page of this thread!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
blandead
Newbie
*
Offline Offline

Activity: 46
Merit: 0


View Profile
February 22, 2012, 06:44:57 AM
 #82

I got vectors3 working in case you're curious, speed is amazing : O
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 22, 2012, 07:17:53 AM
 #83

I got vectors3 working in case you're curious, speed is amazing : O

More details please Wink ... used GPU, which kernel. I never got that working with Phoenix.
And what means "amazing" Cheesy?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
blandead
Newbie
*
Offline Offline

Activity: 46
Merit: 0


View Profile
February 26, 2012, 11:18:30 PM
 #84

I got vectors3 working in case you're curious, speed is amazing : O

More details please Wink ... used GPU, which kernel. I never got that working with Phoenix.
And what means "amazing" Cheesy?

Dia

Was using your latest kernel with phoenix 2  rc1 on a cayman 6970 cat 12.3

New opencl version has built in support for uint3 so it might even work with your older version. It's basically just doing a typedef uint4 uint3, and it just assigns a 0 for last value of uint3 to ignore 'w'. The problem with your old kernel is you need to initialize uint3 as a uint4 first

Speed is good because you basically get to run vectors4 (vectors3) with less alu's. Even tho 'w' is not used finding x, y, and z nonces occurs much more frequently anyways
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 27, 2012, 11:43:36 PM
 #85

I got vectors3 working in case you're curious, speed is amazing : O

More details please Wink ... used GPU, which kernel. I never got that working with Phoenix.
And what means "amazing" Cheesy?

Dia

Was using your latest kernel with phoenix 2  rc1 on a cayman 6970 cat 12.3

New opencl version has built in support for uint3 so it might even work with your older version. It's basically just doing a typedef uint4 uint3, and it just assigns a 0 for last value of uint3 to ignore 'w'. The problem with your old kernel is you need to initialize uint3 as a uint4 first

Speed is good because you basically get to run vectors4 (vectors3) with less alu's. Even tho 'w' is not used finding x, y, and z nonces occurs much more frequently anyways
Why can't you just init as a uint3 in the first place?  Did you not change the python init file to handle the uint3 as of yet?

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
blandead
Newbie
*
Offline Offline

Activity: 46
Merit: 0


View Profile
February 28, 2012, 02:49:13 AM
 #86

Why can't you just init as a uint3 in the first place?  Did you not change the python init file to handle the uint3 as of yet?

I did change the python init file, but just realized I accidentally set rateDivisor = 3 out of habit, and it's supposed to be 4 since you're supposed to treat uint3 as a uint4. It should work like that, but I'll finish testing tomorrow

You can't init as a uint3 in the first place because there is no definition for a true "uint3" in that sense, it's just a stripped down uint4 (last value is thrown out or used as a placeholder). I'm gonna work on it more tomorrow, but once I get it working this should give a 10% increase over uint2. It only uses ~920 ALUs so far
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
February 28, 2012, 02:54:31 AM
 #87

Great to see that.

Be careful with speed calculations though, you might be calculating the MH/s based on that presumption that you are doing 4 nonces per workitem which is wrong.
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 28, 2012, 04:44:30 AM
 #88

uint3 seems to be working fine for me without having to treat it as a uint4.  As a matter of fact, it's even listed in the OpenCL reference card.  But wouldn't it be best to accept the extra ALUs and do the work for the 4th vector if the space for the 4th vector is already being taken up?  It's sort of a trade-off.  Mathematically, do the ALUs for three vectors justify the use of three or do 4 vectors give more output for the work that's done to achieve it?

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
blandead
Newbie
*
Offline Offline

Activity: 46
Merit: 0


View Profile
February 28, 2012, 07:17:48 PM
 #89

uint3 is in there, but this is how it's listed

/* cl_uint3 is identical in size, alignment and behavior to cl_uint4. See section 6.1.5. */
typedef  cl_uint4  cl_uint3;

Anyways I'm still getting more init errors that have nothing to do with ratedivisor, gonna try a different version of phoenix

You'd think it would be better to use vectors4, but the ALUs shoot up to over 3000
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
March 01, 2012, 11:55:55 PM
 #90

uint3 is in there, but this is how it's listed

/* cl_uint3 is identical in size, alignment and behavior to cl_uint4. See section 6.1.5. */
typedef  cl_uint4  cl_uint3;

Anyways I'm still getting more init errors that have nothing to do with ratedivisor, gonna try a different version of phoenix

You'd think it would be better to use vectors4, but the ALUs shoot up to over 3000
That was quoted from cl_platform.h
You can use uint3 directly now.  You don't have to typedef anything to get it to work now.  Just use uint3 directly and it will be a true 3-component uint.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
blandead
Newbie
*
Offline Offline

Activity: 46
Merit: 0


View Profile
March 02, 2012, 05:33:38 PM
 #91

I am was just putting that there for reference not sure if it's in older versions? I got it working fine but there is something messed up with the output buffer it always says it's "lo - 3". Maybe I should stick with outputting as ulong? Why is it like that exactly, I tried doing a global output of uint3 which still works, but result.z will sometimes say difficulty is less then 1 (it's just barely outside the range too). I have no problem with result.x or result.y tho
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
March 03, 2012, 04:49:35 AM
 #92

I am was just putting that there for reference not sure if it's in older versions? I got it working fine but there is something messed up with the output buffer it always says it's "lo - 3". Maybe I should stick with outputting as ulong? Why is it like that exactly, I tried doing a global output of uint3 which still works, but result.z will sometimes say difficulty is less then 1 (it's just barely outside the range too). I have no problem with result.x or result.y tho
There may be an offset in z.  Try replacing them with s0 through s2 even though they're the same.  It'll keep all of the code compatible with each other.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
blandead
Newbie
*
Offline Offline

Activity: 46
Merit: 0


View Profile
March 22, 2012, 09:39:33 PM
Last edit: March 22, 2012, 10:26:12 PM by blandead
 #93

I prefer to use result.s0 and such anyways, but I don't use the offset feature as it only lowers 1 APU but raises GPR making the card run hotter. Anyways, I found time to fix it and get it working, all three outputs are good. I was mistaken with the output buffer since there is no (hi - 3) with vectors 3 as there is no result.w

So, now it correctly reports (lo - 1), (hi -1), or (lo - 3). Does anyone want to try it out?
Logist
Newbie
*
Offline Offline

Activity: 9
Merit: 0


View Profile
May 25, 2012, 01:32:05 PM
 #94

Am I blind. whered o i download this kernel.. links on first page lead to just http://www.filedropper.com/ page and nothing happens Cheesy i have a 7970 to test out and im trying to go with phoenix 2.0 and diakgcn kernel. Maybe you can PM me the test version?
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
May 25, 2012, 01:50:03 PM
 #95

I only zipped the last development version and upped it here:
http://www.filedropper.com/diakgcnphoenix2

Sorry, there is no current readme, but the first posting should cover everything.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
purelithium
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500



View Profile
July 21, 2012, 07:31:45 PM
 #96

Ok, none of the download links for the kernel work. Can you re-up these to a different host? I think they deleted your files. I really want to try your kernel, but it's not working.

Like my post? 1H7bfRYh7F89mfmFgsRCdn4awDaUHQmYqY
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 21, 2012, 11:14:03 PM
 #97

Ok, none of the download links for the kernel work. Can you re-up these to a different host? I think they deleted your files. I really want to try your kernel, but it's not working.

Just use the current CGMINER and specify these switches
Code:
-k diakgcn -v 2 -w 256
.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
purelithium
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500



View Profile
July 22, 2012, 12:08:33 AM
 #98

I use phoenix 2.0.0, so this won't work.

With cgminer, I get 280 mhash/s(regardless of what kernel I specify) with Phoenix 2 and phatk2, I get 320 with my 6870. I want to see what your kernel will do on phoenix.

Like my post? 1H7bfRYh7F89mfmFgsRCdn4awDaUHQmYqY
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
July 22, 2012, 09:30:16 AM
 #99

I use phoenix 2.0.0, so this won't work.

With cgminer, I get 280 mhash/s(regardless of what kernel I specify) with Phoenix 2 and phatk2, I get 320 with my 6870. I want to see what your kernel will do on phoenix.

Try this link: http://www.filedropper.com/diakgcnphoenix2

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
purelithium
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500



View Profile
July 22, 2012, 01:32:50 PM
 #100

Thanks a lot for doing that!

I tried it with my 6870, couldn't get it to go past 300 Mh/sec I'm gonna stick with phatk2 and my 320 Mh/sec

Like my post? 1H7bfRYh7F89mfmFgsRCdn4awDaUHQmYqY
ajareselde
Legendary
*
Offline Offline

Activity: 1722
Merit: 1000

Satoshi is rolling in his grave. #bitcoin


View Profile
August 01, 2012, 04:29:11 AM
 #101

can u repost kernel for phoenix 1, your links are broken
thanks
zvs
Legendary
*
Offline Offline

Activity: 1680
Merit: 1000


https://web.archive.org/web/*/nogleg.com


View Profile WWW
August 25, 2012, 11:01:52 AM
 #102

can u repost kernel for phoenix 1, your links are broken
thanks
i have some of the older ones (that are/were on mediafire), none of those others though

http://www.nogleg.com/archive/
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
September 20, 2012, 05:17:29 PM
 #103

Upped the last Phoenix2 version again as some requested that, no more code updates though.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Pages: 1 2 3 4 5 6 [All]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!