Bitcoin Forum
August 18, 2017, 05:46:31 AM *
News: Latest stable version of Bitcoin Core: 0.14.2  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 [All]
  Print  
Author Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels  (Read 56148 times)
This is a self-moderated topic. If you do not want to be moderated by the person who started this topic, create a new topic.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 08:00:34 AM
 #1

**** MYRIAD GROESTL ****

If you are looking for the closed source myriad groestl miner (for DGB, SFR, etc.) look here instead:

https://satoshibox.com/fttcfvpiyhbod7ueidmgdhym

ABOUT

This is my optimized Groestlcoin / Diamond and similar opencl kernel (groestl + groestl algorythm, not myriad-groestl which is groestl + sha, see the top of this post for the latter).
It is based on the sph version originally available on sph-sgminer but is now totally rewritten.
It should be compatible with all sph-sgminer versions and derivates.

PERFORMANCE

v1 - to be compiled with catalyst 14.6 or 14.7:

R9 290x @1125 Mhz: ~26.4 Mh/s
R9 290 @1200: ~25 Mh/s
R9 280x (stock): ~18 Mh/s
7950 @1200: ~16 Mh/s
R9 270X: ~9.7 Mh/s

v2 - experimental hawaii only bin:

R9 290x @1125 Mhz: ~34.4 Mh/s
R9 290 @1100: ~30.6 Mh/s

Wolf0's Tahiti binary:

R9 280x: ~25 Mh/s

HOW TO USE

- Stop the miner
- Replace groestlcoin.cl, diamond.cl and/or the kernel you want to use with this one (it's inside the "kernel" folder)
- Remove all the .bin files (in the main folder)
- Set worksize to 256 only (-w 256)
- Run and enjoy!

TWEAKING

Set intensity from 20 to 22. Thread concurrency and all the other parameters are useless.
This kernel doesn't make use of gpu ram, so set the ram clock to THE MINIMUM POSSIBLE VALUE; for example 150 MHz for R9 290.
Now play with the core clock until you find the highest stable value (probably between 1100 and 1200 for the R9 290).

COMPATIBILITY

Tested working stable on R9 290, 280x and 7950. Should work on any recent amd gpu but performance is not guaranted to be optimal.
I doesn't work with cryptohunger optimized pool: use the conventional port or another pool. Also do not replace the optimized kernel of grs-sgminer but the normal one.

TROUBLESHOOTING

Try the following:
- Sure you set worksize to 256?
- Replace the generated .bin file with this one (64 bit, r9 280(x) and 290(x) only): LINK EXPIRED (diamondHawaiiw256l8.bin), see below for a newer binary file
- Lower the intensity
- Lower the core speed (are you sure you put the ram clock to the lowest possible value?)
- Since it uses more power, it could be a cooling issue too: check the gpu temperature

DONATIONS

This work took me months of coding and testing and unslept nights; please show your appreciation (you are making more money by using it!) by donating to:
BTC: 1H7qC5uHuGX2d5s9Kuw3k7Wm7xMQzL16SN

DOWNLOAD

Opensource Kernel (v1):
https://app.box.com/s/9vikvemf7acio3uns7taorodqmod42ej

Experimental Hawaii bin (v2):
https://app.box.com/s/zsr29tfgv4tpxs1q7451dayzaw3wnoee

Wolf0's Tahiti bin (https://bitcointalk.org/index.php?topic=779598.msg11778971#msg11778971):
https://ottrbutt.com/miner/wolf-groestlcoinTahitigw256l4.bin

1503035191
Hero Member
*
Offline Offline

Posts: 1503035191

View Profile Personal Message (Offline)

Ignore
1503035191
Reply with quote  #2

1503035191
Report to moderator
1503035191
Hero Member
*
Offline Offline

Posts: 1503035191

View Profile Personal Message (Offline)

Ignore
1503035191
Reply with quote  #2

1503035191
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1503035191
Hero Member
*
Offline Offline

Posts: 1503035191

View Profile Personal Message (Offline)

Ignore
1503035191
Reply with quote  #2

1503035191
Report to moderator
1503035191
Hero Member
*
Offline Offline

Posts: 1503035191

View Profile Personal Message (Offline)

Ignore
1503035191
Reply with quote  #2

1503035191
Report to moderator
1503035191
Hero Member
*
Offline Offline

Posts: 1503035191

View Profile Personal Message (Offline)

Ignore
1503035191
Reply with quote  #2

1503035191
Report to moderator
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 08:19:21 AM
 #2

A BIT OF HISTORY

The first gpu miner for groestlcoin and similar was sph-sgminer by phm. Optimizing the original implementation was trivial (almost 3x the speed could be achived!), so probably there are tens of optimized versions around, many of which have been kept private: mining groestlcoin and similar was always unfair for most people, at least for non-devs.
Hopefully this kernel will end this and should also level the field between amd and nvidia.
I believe my version is faster than many of the other kernels because of the time I dedicated to it and the thousands of tests I did.

FINAL ADVICES

I suggest to keep "good binaries": make a backup of the fastest .bin files you have, so you can recover them in case of driver problems.
Also this will enable you to get 1 o 2 percent more hashrate because of compiler variance (try removing the bin and running 3/4 times to see the variance in action).
Or use the provided bin file (see the OP) which should be a good one.

I've experienced lower power usage with catalyst 14.9 compared to 14.6 beta (a bit less compared to 13 but still better). Speaking of optimization, this should be kept in mind: buy a power meter for your miner(s)!
But it looks like the compiler included in 14.9 drivers produces binaries which run considerably slower than older releases. If you are on 14.9, use the provided bin file instead of the kernel source in .cl format.

cryptonit
Legendary
*
Offline Offline

Activity: 1344


CVO Diamond Foundation (Visionary)


View Profile WWW
September 12, 2014, 08:20:05 AM
 #3

thx a lot for ur effort
to make best possible amd based mining open source avaiable for
DMD Diamond


popshot
Hero Member
*****
Offline Offline

Activity: 750


CEO Diamond Foundation


View Profile WWW
September 12, 2014, 08:40:05 AM
 #4

Pallas you are Prometheus, spending your time and skills in creating something useful to a lot of people and at the end opening it to all interested. Kudos  Smiley

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 08:49:49 AM
 #5

That's what opensource is about ;-)
I'm a linux guy for 20 years now and I remember public domain software since the commodore age (around 1984).

srcxxx
Sr. Member
****
Offline Offline

Activity: 266


View Profile WWW
September 12, 2014, 08:53:53 AM
 #6

Hi pallas!

Good stuff!

Have you tried other groestl algorithms? Like byte slicing or bit slicing?

Did you also have strange situations where better looking code actually ran slower?

PS: I wish you had opened the code earlier, Groestlcoin even had a bounty for an optimized miner.

Thanks
srcxxx
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 09:00:40 AM
 #7

Hi pallas!

Good stuff!

Have you tried other groestl algorithms? Like byte slicing or bit slicing?

Did you also have strange situations where better looking code actually ran slower?

PS: I wish you had opened the code earlier, Groestlcoin even had a bounty for an optimized miner.

Thanks
srcxxx

I did a lot of optimizations which looked smart but lead to slower code. You can't imagine how many. Some of them I though of while in bed and made me not sleep until I could try them. Then the delusion :-D
I believe it's because of the optimizations the compiler does but most of all about local memory and cache access.
If you optimize some instructions but leading to less parallelism in memory access, you'll get slower code.
That's typical for groestl because the speed is limited mostly by the memory (otherwise it would be faster like keccak, blake etc.).
I know there was a bounty: my code was unfinished and, at the time, it was the only way I could mine without loosing money.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
September 12, 2014, 09:04:14 AM
 #8

Wow that's a nice improvement on hashrate Smiley  Now tuning for stability on my miners ...
Sending a donation your way next block find Smiley

Testing on HD7950 and R9280X and will report my hashrates when I get it stable Smiley
Both cards run considerably hotter and 100% fan ...
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 09:07:18 AM
 #9

Wow that's a nice improvement on hashrate Smiley  Now tuning for stability on my miners ...
Sending a donation your way next block find Smiley

Thanks!
Let me know your figures.
I need 280x and 290x hashrates, to put in the op.

srcxxx
Sr. Member
****
Offline Offline

Activity: 266


View Profile WWW
September 12, 2014, 09:22:19 AM
 #10

Hi pallas!

Good stuff!

Have you tried other groestl algorithms? Like byte slicing or bit slicing?

Did you also have strange situations where better looking code actually ran slower?

PS: I wish you had opened the code earlier, Groestlcoin even had a bounty for an optimized miner.

Thanks
srcxxx

I did a lot of optimizations which looked smart but lead to slower code. You can't imagine how many. Some of them I though of while in bed and made me not sleep until I could try them. Then the delusion :-D
I believe it's because of the optimizations the compiler does but most of all about local memory and cache access.
If you optimize some instructions but leading to less parallelism in memory access, you'll get slower code.
That's typical for groestl because the speed is limited mostly by the memory (otherwise it would be faster like keccak, blake etc.).
I know there was a bounty: my code was unfinished and, at the time, it was the only way I could mine without loosing money.

I know. I actually think that the compiler is not that clever and that's why sometimes worse code runs faster.
Also, I looked at ASM and some stuff there is just plain not optimal. Perhaps it'll be improved in future versions of AMD drivers.

Also, most ASM code only uses .xy from a register. I tried making it work on ulong2 or ulong8 - only slower.

I wish it was possible to write GPU code in assembler...
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 09:33:44 AM
 #11

Again it's mostly about memory for groestl: optimizing register operations might lead to unnoticeable gain but you may loose on memory access.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
September 12, 2014, 09:34:59 AM
 #12

@pallas
What's your DMD donation address Smiley Found 2 blocks in like 15 minutes (LUCK!)
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 09:36:25 AM
 #13

@pallas
What's your DMD donation address Smiley Found 2 blocks in like 15 minutes (LUCK!)

good!
my DMD address is dVrz69vZFrxJRH9AnKyHim7Hd3PhY3w9NQ

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
September 12, 2014, 09:39:37 AM
 #14

Sent ya 0.5 DMD for now, will send some more after it runs stable for a day Smiley
Transaction ID: 37bca0a9872845908b4fc4e223d920b3355b5bbbb54de97a583aee67c7b4605d
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
September 12, 2014, 10:08:46 AM
 #15

I suppose I'll have to upgrade the RAM on my miner box from 2G to 4G now ... occasional 14.7RC2 driver crashes at I=22
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 10:10:00 AM
 #16

I suppose I'll have to upgrade the RAM on my miner box from 2G to 4G now ... occasional 14.7RC2 driver crashes at I=22

does it work at I=21? there should be little difference in hashrate.

Ivanech
Hero Member
*****
Offline Offline

Activity: 786



View Profile WWW
September 12, 2014, 10:10:34 AM
 #17

Have anybody tried with 270X cards - what hashrate should I expect?

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
September 12, 2014, 10:18:09 AM
 #18

I suppose I'll have to upgrade the RAM on my miner box from 2G to 4G now ... occasional 14.7RC2 driver crashes at I=22

does it work at I=21? there should be little difference in hashrate.

Actually runs faster at I=21 Smiley
Have not messed with GPU or MEM clocks just defaults Smiley  
(Powercolor) 280X  18MHs 67C-68C
(Powercolor) 7950  16MHs 68C-69C
Both cards are volt-modded to lower than stock ...
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 10:48:25 AM
 #19

I suppose I'll have to upgrade the RAM on my miner box from 2G to 4G now ... occasional 14.7RC2 driver crashes at I=22

does it work at I=21? there should be little difference in hashrate.

Actually runs faster at I=21 Smiley
Have not messed with GPU or MEM clocks just defaults Smiley  
(Powercolor) 280X  18MHs 67C-68C
(Powercolor) 7950  16MHs 68C-69C
Both cards are volt-modded to lower than stock ...

good, but if you lower the mem clock you will save power, and get higher maximum core clock as well.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 10:56:02 AM
 #20

added to the op:

IF IN TROUBLE, TRY REPLACING THE GENERATED .BIN FILE WITH THIS ONE: https://dl.dropboxusercontent.com/u/40353042/Diamond/diamondHawaiiw256l8.bin

echo00114
Member
**
Offline Offline

Activity: 63


View Profile
September 12, 2014, 12:42:42 PM
 #21

Have anybody tried with 270X cards - what hashrate should I expect?

hello
i try saphire toxic but not good all hw error and i try lower gpu mhz but not work 1150 mhz still 1150  o use grs-sgminer 1.4 i use 14.6 amd driver rc.

bye
update 1:
i try different miner sgminerGroestl4_1_0_1  and this cl working great 9.7 mhs  I 22 and  I 20  9.5 mhs 

bye
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 12:45:52 PM
 #22

Have anybody tried with 270X cards - what hashrate should I expect?

hello
i try saphire toxic but not good all hw error and i try lower gpu mhz but not work 1150 mhz still 1150  o use grs-sgminer 1.4 i use 14.6 amd driver rc.

bye

since I've never tested it on 270x there could be issues, but please try:

- using the bin file from the OP
- lowering the intensity
- lowering core speed (are you sure you put the ram clock to the lowest possible value?)

since it uses more power, it could be a cooling issue too.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
September 12, 2014, 12:59:07 PM
 #23

Have anybody tried with 270X cards - what hashrate should I expect?

hello
i try saphire toxic but not good all hw error and i try lower gpu mhz but not work 1150 mhz still 1150  o use grs-sgminer 1.4 i use 14.6 amd driver rc.

bye

since I've never tested it on 270x there could be issues, but please try:

- using the bin file from the OP
- lowering the intensity
- lowering core speed (are you sure you put the ram clock to the lowest possible value?)

since it uses more power, it could be a cooling issue too.
If he is using grs-sgminer then he must be using cryptohunger pool which would be incompatible with
a real groestlcoin kernel ... have to use sph-sgminer, not proprietary grs-sgminer with this kernel Smiley
cryptohunger requires grs-sgminer, so move to a different pool if you want to try this kernel or just solo mine like I do Smiley
Litejavichu
Hero Member
*****
Offline Offline

Activity: 675



View Profile
September 12, 2014, 01:21:15 PM
 #24

The version I have proposed, goes perfect with 13.2. go to control panel, uninstall 14 and then select install driver 13.2

7950=13Mhs I 21
7970=14-15mhs I21
But these optimized versions hotter than the GPU and more electricity...
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 01:27:48 PM
 #25

The version I have proposed, goes perfect with 13.2. go to control panel, uninstall 14 and then select install driver 13.2

7950=13Mhs I 21
7970=14-15mhs I21
But these optimized versions hotter than the GPU and more electricity...

fine, thanks, but my kernel is faster...

cryptonit
Legendary
*
Offline Offline

Activity: 1344


CVO Diamond Foundation (Visionary)


View Profile WWW
September 12, 2014, 01:40:08 PM
 #26

21 mhash amd290 at 1020gpu clock
pretty powerhungry (dual 290 are around 150w more useage on this than on x11)
but its the best groestl amd miner i did ever see regarding hashrate per card
great job!

i still use http://multipool.bit.diamonds/ (able mine DMD with nearly any existing algo)
amd cards running x11 most time and nvidia cards nist5
becaue at austria powercost i need to be very poweruseage sensitive

no mining gear? still wana have a constant DMD income? check out DMD Cloudmining
no electricity no heat no maintainance
one time invest forever DMD payouts!

Quote


DMD Multipool total earned DMD jumped across the 10000 DMD mark!


--------------------------------------------------------------------------
--------------------------------------------------------------------------
DMD Multipool Lotto
earn lotto numbers for free when u mine at DMD Multipool
next drawing 5. Oktober 100 DMD
http://multipool.bit.diamonds/

--------------------------------------------------------------------------
--------------------------------------------------------------------------
No mining gear to join DMD Multipool?
Get some DMD Cloudmining shares.
We will give them ability to earn DMD lotto tickets too.....







stillontop
Member
**
Offline Offline

Activity: 88


View Profile
September 12, 2014, 09:28:06 PM
 #27

Pallas, you are great!

Here are my numbers:
290x @ 1125 Mhz ~ 26,4 Mh/s
290x @ 1040 Mhz ~ 24,4 Mh/s

I am going to send you 20% of the next ten blocks I will find,
just because I want to donate to an honest and sharing person.


To use the bin file correctly, you have to remove all bin files in the main folder, download pallas bin file, put it into the main folder,
start your miner, copy the filename of your automatically created bin file after you started the miner. Then you stop the miner and rename pallas bin-file with the copied filename.

DMD-Address: dQxRzzz1Ae8J46V7KGhvSrbngTodtp5fp7
srcxxx
Sr. Member
****
Offline Offline

Activity: 266


View Profile WWW
September 12, 2014, 09:56:16 PM
 #28

Hi guys!

I am not sure if you understand it yet, but your profits are not going to increase. Unfortunately. :-(

Perhaps in the upcoming 5-10 days, but then it'll all be as before. And you'll be paying more for electricity.

That is one of the reasons I did not open source my miner.

Cause not only you are going to move to the faster kernel. Everybody will move to the new kernel.
So not just your hashrate will increase, everybody's hashrate will increase as well.

And so, the total coins distribution between miners will not change, but you're going to pay more to electricity companies.

The new kernel eats a lot more electricity and makes more heat.

Which is bad for the algo we're mining.
Also it makes CPU mining a lot less profitable compared with GPU mining. Also bad.

The root of all evil,
srcxxx
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 10:20:01 PM
 #29

Hi guys!

I am not sure if you understand it yet, but your profits are not going to increase. Unfortunately. :-(

Perhaps in the upcoming 5-10 days, but then it'll all be as before. And you'll be paying more for electricity.

That is one of the reasons I did not open source my miner.

Cause not only you are going to move to the faster kernel. Everybody will move to the new kernel.
So not just your hashrate will increase, everybody's hashrate will increase as well.

And so, the total coins distribution between miners will not change, but you're going to pay more to electricity companies.

The new kernel eats a lot more electricity and makes more heat.

Which is bad for the algo we're mining.
Also it makes CPU mining a lot less profitable compared with GPU mining. Also bad.

The root of all evil,
srcxxx

hmmm not exactly.
the biggest miners are already using optimized miners made by themselves; this kernel puts the common people on a higher level so they can compete.
smaller miners will make more coins and biggest ones less.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 12, 2014, 10:21:02 PM
 #30

Pallas, you are great!

Here are my numbers:
290x @ 1125 Mhz ~ 26,4 Mh/s
290x @ 1040 Mhz ~ 24,4 Mh/s

I am going to send you 20% of the next ten blocks I will find,
just because I want to donate to an honest and sharing person.


To use the bin file correctly, you have to remove all bin files in the main folder, download pallas bin file, put it into the main folder,
start your miner, copy the filename of your automatically created bin file after you started the miner. Then you stop the miner and rename pallas bin-file with the copied filename.

thanks very much! :-)

lecaillou
Newbie
*
Offline Offline

Activity: 11


View Profile
September 12, 2014, 11:09:25 PM
 #31

Hi, sorry but I don't see where I can download the last sph-sgminer???<
Thanks
Séb
stillontop
Member
**
Offline Offline

Activity: 88


View Profile
September 12, 2014, 11:20:57 PM
 #32

Maybe it would be a wise decision to use a power saver kernel in summer, to enhance your graphics card lifetime and a more power version in winter to keep your feet warm Smiley

DMD-Address: dQxRzzz1Ae8J46V7KGhvSrbngTodtp5fp7
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 13, 2014, 08:01:14 AM
 #33

Maybe it would be a wise decision to use a power saver kernel in summer, to enhance your graphics card lifetime and a more power version in winter to keep your feet warm Smiley

Or just lower the core clock ;-)

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 13, 2014, 08:02:29 AM
 #34

Hi, sorry but I don't see where I can download the last sph-sgminer???<
Thanks
Séb

You can use the "classic" version as well as the new version 5 (and probably other similar miners).

Ivanech
Hero Member
*****
Offline Offline

Activity: 786



View Profile WWW
September 13, 2014, 09:24:05 AM
 #35

Excelent work!

Got about 15 MH/s for my Gigabyte 270X compared to 13 MH/s previously.

Is it possible to release other optimized kernels? Qubitcoin for example?

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 13, 2014, 10:30:26 AM
 #36

Excelent work!

Got about 15 MH/s for my Gigabyte 270X compared to 13 MH/s previously.

Is it possible to release other optimized kernels? Qubitcoin for example?

Yes some improvements can be made but I have no time to dedicate to it :-)

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 14, 2014, 12:25:09 PM
 #37

Can someone please post a 32 bit Hawai or Tahiti binary?
The filename should end in "256l4.bin".

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
September 14, 2014, 10:53:02 PM
 #38

has anyone tested this with sgminer v5?  I am currently running sph-sgminer 4.1.0
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 16, 2014, 09:35:23 AM
 #39

if someone is still getting hardware errors, ensure you are using a worksize of 256 (-w 256 on the commandline).
some optimizations depend on it so don't set it to a lower value.
also don't set it to a higher value simply because 256 is the maximum and that value will be used instead.

HR
Legendary
*
Offline Offline

Activity: 1036


Transparency & Integrity


View Profile
September 20, 2014, 08:53:10 AM
 #40


Congratulations Pallas.

Do you have it on Github already?


If there is only 4 times more LTC than BTC, when is LTC going to be worth 1/4 of BTC?
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 20, 2014, 09:49:22 AM
 #41


Congratulations Pallas.

Do you have it on Github already?

Thanks.
No git yet. I was thinking it's not necessary for a single file, but now there are bins and maybe multiple versions so I may end up doing it.

qwep
Legendary
*
Offline Offline

Activity: 1008



View Profile
September 20, 2014, 06:58:13 PM
 #42


Congratulations Pallas.

Do you have it on Github already?

Thanks.
No git yet. I was thinking it's not necessary for a single file, but now there are bins and maybe multiple versions so I may end up doing it.
congratulations excellent optimization, but what about the other algorithms
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 20, 2014, 08:02:24 PM
 #43


Congratulations Pallas.

Do you have it on Github already?

Thanks.
No git yet. I was thinking it's not necessary for a single file, but now there are bins and maybe multiple versions so I may end up doing it.
congratulations excellent optimization, but what about the other algorithms

I did work on other algorithms too (x11 components, m7 and others) but nothing ready for publication.
I'd need more time and lower kilowatt hour cost in order to go ahead :-)

qwep
Legendary
*
Offline Offline

Activity: 1008



View Profile
September 20, 2014, 08:49:54 PM
 #44


Congratulations Pallas.

Do you have it on Github already?

Thanks.
No git yet. I was thinking it's not necessary for a single file, but now there are bins and maybe multiple versions so I may end up doing it.
congratulations excellent optimization, but what about the other algorithms

I did work on other algorithms too (x11 components, m7 and others) but nothing ready for publication.
I'd need more time and lower kilowatt hour cost in order to go ahead :-)
I understand that all miners are written C ++, and why not in C #, it is a bit faster than the C ++
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 20, 2014, 08:57:27 PM
 #45


Congratulations Pallas.

Do you have it on Github already?

Thanks.
No git yet. I was thinking it's not necessary for a single file, but now there are bins and maybe multiple versions so I may end up doing it.
congratulations excellent optimization, but what about the other algorithms

I did work on other algorithms too (x11 components, m7 and others) but nothing ready for publication.
I'd need more time and lower kilowatt hour cost in order to go ahead :-)
I understand that all miners are written C ++, and why not in C #, it is a bit faster than the C ++

Most miners are written in plain C but it doesn't matter that much unless you are mining with the CPU (still there is a good deal of assembly on some algorithms). GPU code is opencl or cuda instead.

smolen
Hero Member
*****
Offline Offline

Activity: 525


View Profile
September 24, 2014, 12:09:06 AM
 #46

Take a look at my groestl implementation Wink

Of course I gave you bad advice. Good one is way out of your price range.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 24, 2014, 07:06:02 AM
 #47


Thanks. From a first look, I don't see anything I haven't tried yet :-)
Do you have some hashrate figures?

smolen
Hero Member
*****
Offline Offline

Activity: 525


View Profile
September 24, 2014, 05:13:54 PM
 #48


Thanks. From a first look, I don't see anything I haven't tried yet :-)
Do you have some hashrate figures?
Sorry, no testing results, I'm away from all crypto stuff, that's rather abandoned project, collecting virtual dust on HDD...
I didn't quite grok all your tricks Smiley I only use 3 arrays of 32 integers for intermediate results, so memory usage should be almost minimal and such buffer reusing could be an independent optimization, quite sure you have tried the rest Smiley

Of course I gave you bad advice. Good one is way out of your price range.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 24, 2014, 05:58:15 PM
 #49

I did the best I could to reduce register usage, but in the end using more 64 bits turned out to be faster. Probably private memory is not fully used...

smolen
Hero Member
*****
Offline Offline

Activity: 525


View Profile
September 24, 2014, 06:29:22 PM
 #50

in the end using more 64 bits turned out to be faster. Probably private memory is not fully used...
May be 64 bit math tricked AMD OpenCL compiler away from useless 'optimizations' Smiley I once get strange effect when inserting absolutely unrelated operations (well, it was copy protection) in the middle of big number crunching resulted in ~5% speed increase.

Of course I gave you bad advice. Good one is way out of your price range.
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
September 24, 2014, 06:58:55 PM
 #51

@pallas
Any chance of you integrating your groestl kernel into the optimised X11 and X13 kernels on my BCT thread (in my sig) Smiley
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 24, 2014, 08:18:21 PM
 #52

@pallas
Any chance of you integrating your groestl kernel into the optimised X11 and X13 kernels on my BCT thread (in my sig) Smiley

I'll have a look asap. Actually I don't have much free time now, so it may take a bit.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 24, 2014, 08:21:07 PM
 #53

in the end using more 64 bits turned out to be faster. Probably private memory is not fully used...
May be 64 bit math tricked AMD OpenCL compiler away from useless 'optimizations' Smiley I once get strange effect when inserting absolutely unrelated operations (well, it was copy protection) in the middle of big number crunching resulted in ~5% speed increase.

Sometimes it looks random indeed! :-D
And sometimes compiling the same .cl file leads to different hashrates O_o

istvandv
Sr. Member
****
Offline Offline

Activity: 339


View Profile
September 25, 2014, 04:58:45 AM
 #54

@pallas
Any chance of you integrating your groestl kernel into the optimised X11 and X13 kernels on my BCT thread (in my sig) Smiley

I'll have a look asap. Actually I don't have much free time now, so it may take a bit.

while you are at it, how about myr-groestl?  Grin

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 25, 2014, 07:28:17 AM
 #55

@pallas
Any chance of you integrating your groestl kernel into the optimised X11 and X13 kernels on my BCT thread (in my sig) Smiley

I'll have a look asap. Actually I don't have much free time now, so it may take a bit.

while you are at it, how about myr-groestl?  Grin

I had a quick look some time ago: some of the tricks that work on this kernel do not make myr-groestl any faster, thus I'd need to re-tune it from scratch... :-/

Spider07
Full Member
***
Offline Offline

Activity: 134


View Profile
September 28, 2014, 07:08:43 AM
 #56

Hello
How do you do to have a such result ? I have only 6
My settings are wrong ?

sgminer.exe -k groestlcoin -o localhost:17772 -u XXXX -p XXXXXXXX -I 22 -w 256 -g 1 --thread-concurrency 24000 --gpu-engine 1100 --gpu-memclock 1250


Thanks
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 28, 2014, 07:26:05 AM
 #57

Hello
How do you do to have a such result ? I have only 6
My settings are wrong ?

sgminer.exe -k groestlcoin -o localhost:17772 -u XXXX -p XXXXXXXX -I 22 -w 256 -g 1 --thread-concurrency 24000 --gpu-engine 1100 --gpu-memclock 1250


Thanks

See the troubleshooting on the op.
Lower your memory clock, try the compiled binary.

Spider07
Full Member
***
Offline Offline

Activity: 134


View Profile
September 28, 2014, 07:33:57 AM
 #58

Oh
Tried to use the compiled binary - no more succes.....
 I paste your binary to my folder, copy the name of my  generate  .bin, delete my .bin and rename your .bin with the name of my.bin, run again my .bat
Is-it the good way to proceed ?

Also changed to --gpu-memclock 350 without any success
Thanks
losk22
Legendary
*
Offline Offline

Activity: 1134



View Profile
September 28, 2014, 09:24:09 AM
 #59

 Replace groestlcoin.cl, diamond.cl and/or the kernel you want to use with this one (it's inside the "kernel" folder) You can read more?

▄▄████ ▀████▄▄
▄▄██▀▀ ▄▄ ▄██ ▄▄▄ ▀▀██▄▄
▄█▀▀ ▄▄▀▀   ███▄   ▀▀▄▄ ▀▀█▄
▄█▀ ▄█▀    ▄█ ▀████▄     ▀█▄ ▀█▄
▄█ ▄█      ████ ▀█████       █▄ █▄
██ ▄▀   ▄█  ▀████▄ ▀█▀         ▀▄ ██
▄█ ▄█    ███▄  ▀████▄▄           █▄ █▄
██     ▀████▄  ▀██████▄         █ ██
██       ▀█████▄ ▀▀█████▄       █ ██
██         ▀█████▄▄ ▀████▄      █ ██
▀█ ▀█           ▀█████▄ ▀██▀     █▀ █▀
██ ▀▄        ▄█▄ ▀████▄ █▀     ▄▀ ██
▀█ ▀█      █████▄ ▀███       █▀ █▀
▀█▄ ▀█▄    ▀█████  █▀    ▄█▀ ▄█▀
▀█▄▄ ▀▀▄▄  ▀████   ▄▄▀▀ ▄▄█▀
▀▀██▄▄▄▀▀▀ ██  ▀▀▄▄▄██▀▀
▀▀████ ▀ ▄████▀▀
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀   ██   ████
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀   ▀▀   ▀▀▀▀
  Anonymous    Secure    Untraceable 
Spectre blog
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
twitter
.
Spectre slack
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
facebook
Spider07
Full Member
***
Offline Offline

Activity: 134


View Profile
September 28, 2014, 10:52:27 AM
 #60

Replace groestlcoin.cl, diamond.cl and/or the kernel you want to use with this one (it's inside the "kernel" folder) You can read more?

Thanks for your help.

Sorry forgot to mention that I have done that.
To be sure , I deleted all files in kernel folder
I have only 1 file (groestlcoin-v1.cl renamed to groestlcoin.cl)

I can't reach 16 Mh/s..... Cry   only 6..... Cry
losk22
Legendary
*
Offline Offline

Activity: 1134



View Profile
September 29, 2014, 07:39:45 AM
 #61

ABOUT

This is my optimized Groestlcoin / Diamond and similar opencl kernel (groestl + groestl algorythm, not myriad-groestl which is groestl + sha).
It is based on the sph version originally available on sph-sgminer but is now totally rewritten.
It should be compatible with all sph-sgminer versions and derivates.

PERFORMANCE

R9 290x @1125 Mhz: ~26.4 Mh/s
R9 290 @1200: ~25 Mh/s
R9 280x (stock): ~18 Mh/s
7950 @1200: ~16 Mh/s
R9 270X: ~9.7 Mh/s

HOW TO USE

- Stop the miner
- Replace groestlcoin.cl, diamond.cl and/or the kernel you want to use with this one (it's inside the "kernel" folder)
- Remove all the .bin files (in the main folder)
- Set worksize to 256 only (-w 256)
- Run and enjoy!

TWEAKING

Set intensity from 20 to 22. Thread concurrency and all the other parameters are useless.
This kernel doesn't make use of gpu ram, so set the ram clock to THE MINIMUM POSSIBLE VALUE; for example 150 MHz for R9 290.
Now play with the core clock until you find the highest stable value (probably between 1100 and 1200 for the R9 290).

COMPATIBILITY

Tested working stable on R9 290, 280x and 7950. Should work on any recent amd gpu but performance is not guaranted to be optimal.
I doesn't work with cryptohunger optimized pool: use the conventional port or another pool. Also do not replace the optimized kernel of grs-sgminer but the normal one.

TROUBLESHOOTING

Try the following:
- Sure you set worksize to 256?
- Replace the generated .bin file with this one (64 bit only): https://dl.dropboxusercontent.com/u/40353042/Diamond/diamondHawaiiw256l8.bin
- Lower the intensity
- Lower the core speed (are you sure you put the ram clock to the lowest possible value?)
- Since it uses more power, it could be a cooling issue too: check the gpu temperature

DONATIONS

This work took me months of coding and testing and unslept nights; please show your appreciation (you are making more money by using it!) by donating to:
BTC: 1H7qC5uHuGX2d5s9Kuw3k7Wm7xMQzL16SN
DMD: dVrz69vZFrxJRH9AnKyHim7Hd3PhY3w9NQ

DOWNLOAD

https://dl.dropboxusercontent.com/u/40353042/Diamond/groestlcoin-v1.cl

What coins can be extracted with your algorithm?

▄▄████ ▀████▄▄
▄▄██▀▀ ▄▄ ▄██ ▄▄▄ ▀▀██▄▄
▄█▀▀ ▄▄▀▀   ███▄   ▀▀▄▄ ▀▀█▄
▄█▀ ▄█▀    ▄█ ▀████▄     ▀█▄ ▀█▄
▄█ ▄█      ████ ▀█████       █▄ █▄
██ ▄▀   ▄█  ▀████▄ ▀█▀         ▀▄ ██
▄█ ▄█    ███▄  ▀████▄▄           █▄ █▄
██     ▀████▄  ▀██████▄         █ ██
██       ▀█████▄ ▀▀█████▄       █ ██
██         ▀█████▄▄ ▀████▄      █ ██
▀█ ▀█           ▀█████▄ ▀██▀     █▀ █▀
██ ▀▄        ▄█▄ ▀████▄ █▀     ▄▀ ██
▀█ ▀█      █████▄ ▀███       █▀ █▀
▀█▄ ▀█▄    ▀█████  █▀    ▄█▀ ▄█▀
▀█▄▄ ▀▀▄▄  ▀████   ▄▄▀▀ ▄▄█▀
▀▀██▄▄▄▀▀▀ ██  ▀▀▄▄▄██▀▀
▀▀████ ▀ ▄████▀▀
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀   ██   ████
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀   ▀▀   ▀▀▀▀
  Anonymous    Secure    Untraceable 
Spectre blog
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
twitter
.
Spectre slack
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
facebook
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 29, 2014, 08:14:34 AM
 #62

What coins can be extracted with your algorithm?

the ones using the groestlcoin algo (groestl + groestl): for example diamond, aidbit and atheistcoin.
not myriad groestl like saffroncoin and digibyte.

St.Neman
Jr. Member
*
Offline Offline

Activity: 30


View Profile WWW
September 30, 2014, 08:42:47 PM
 #63

4gpu 280x = 69Mhs = about 7000AID x 0.00000086 = 0.00602 BTC/daily

(power consumption 870W/h)

0,00602BTC = 2,35$/daily
electricity = 1,12$/daily

2,35$ - 1,12$ = 1,23$/daily

AIDBIT
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
September 30, 2014, 09:20:58 PM
 #64

4gpu 280x = 69Mhs = about 7000AID x 0.00000086 = 0.00602 BTC/daily

(power consumption 870W/h)

0,00602BTC = 2,35$/daily
electricity = 1,12$/daily

2,35$ - 1,12$ = 1,23$/daily

That's good! You pay very little for your electricity...
I pay a lot more so it's no longer profitable for me for a long time now.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
September 30, 2014, 11:54:51 PM
 #65

Interesting observation 280x, I had to raise GPU vdc in bios to get stable above 18.4MHs (now 1.118v) ... (was stable at 1.087v using previous kernel).  Best for me 1160gpu/1500mem threads 1, TC 24576 (it does make a difference).  Temps rise dramatically with minor increase in voltage sadly and have capped my overclock attempts as over 75c is enough for me Sad  I have been unable to lower mem clock it seems to be locked (powercolor R9 280x).  Cooler ambient temps coming but will not gain all that much as temps seem to rise exponentially with overclocking (nice heater LOL).
Congrats on profitibility St.Neman I am barely profitable with increased power requirement but the 1 DMD reward window will end soon enough ... take advantage while you can.  Got new (used) Mobo on the way and will soon get main miner back online (only 1 280x atm in entertainment box right now main miner mobo died in a puff of smoke LOL).
You should be able to push 73+ Mhs with 4 cards ... (volt mod and proper heatsink paste application).
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 01, 2014, 12:18:13 AM
 #66

@pallas
have you looked into incorporating super optimized groestl into the X11mod kernel on my thread (in sig)?
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 01, 2014, 07:39:50 AM
 #67

Interesting observation 280x, I had to raise GPU vdc in bios to get stable above 18.4MHs (now 1.118v) ... (was stable at 1.087v using previous kernel).  Best for me 1160gpu/1500mem threads 1, TC 24576 (it does make a difference).  Temps rise dramatically with minor increase in voltage sadly and have capped my overclock attempts as over 75c is enough for me Sad  I have been unable to lower mem clock it seems to be locked (powercolor R9 280x).  Cooler ambient temps coming but will not gain all that much as temps seem to rise exponentially with overclocking (nice heater LOL).
Congrats on profitibility St.Neman I am barely profitable with increased power requirement but the 1 DMD reward window will end soon enough ... take advantage while you can.  Got new (used) Mobo on the way and will soon get main miner back online (only 1 280x atm in entertainment box right now main miner mobo died in a puff of smoke LOL).
You should be able to push 73+ Mhs with 4 cards ... (volt mod and proper heatsink paste application).

It's a pity you can't reduce mem clock because it would save you watts and lower the card temperature (and also let you overclock the core more).
I think the 280x has a maximum core/ram clock range like the 7950: I used to set it at 1150 core and 1000 ram.
If the hashrate changes with thread concurrency, it's probably just compiler variance and not due to the TC buffer which is not used at all. Try removing the bins and restart a handful of times, you'll probably see different hashrates with the exact same settings.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 01, 2014, 04:21:10 PM
 #68

Researching inability to lower mem clock, I use VBE7.0.0.7b.exe and atiwinflash to do the volt mod on 280x and 7950.  
Going to try force mem clock in bios ... I get worried about flashing card so many times but it's worth a try ...
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 01, 2014, 04:33:01 PM
 #69

OK that worked I was able to lower mem clock to 500, trying further OC now ...

Temp dropped from 76-77C to 72C with current settings (1170 GPU 500 MEM) 18.5MHs after 10 min ... will probably lower memclk to 150 like on OP Smiley

Lowered memclk to 150 by bios mod and gpu to 1.112V

Temp 71C after 25 min at 1171/150 18.48Mhs Smiley I'm happy with that and will leave it alone for a week to see if stable Smiley

Display driver crash after 30 min, lowering clock by 1 ...

NOTE: VBios modding is not for the faint of heart, this would likely destroy gaming performance, use only on dedicated mining box Smiley

Question for more experienced miners ... what would I=22 be as xIntensity on r9 280x? (2048 shaders)
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 01, 2014, 06:01:30 PM
 #70

mmmm ... wonderful odor, I am steam curing a pound of virginia tobacco in oven for my own version of dark molasses cavendish ... nice rich smooth smoke and it's getting really dark after 8 hr sample.

I'm wondering what it has to do with this thread.
Still, thanks for the bump ;-)

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 01, 2014, 06:06:20 PM
 #71

I'm getting query from another user on my thread on status of x11mod ... any progress?
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 02, 2014, 07:55:49 AM
 #72

I've found it is needed to keep GPU temp below 72C with the powercolor R9 280x, GPU clock gets throttled down >= 72C ...
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 02, 2014, 08:00:09 AM
 #73

I've found it is needed to keep GPU temp below 72C with the powercolor R9 280x, GPU clock gets throttled down >= 72C ...

72C looks pretty low to me... my 290s throttles at 95 and 85.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 02, 2014, 08:26:45 AM
 #74

I've found it is needed to keep GPU temp below 72C with the powercolor R9 280x, GPU clock gets throttled down >= 72C ...

72C looks pretty low to me... my 290s throttles at 95 and 85.
Not a real concern for me with groestl (running under 72C) but running X11 it is an issue when I OC to 1080/1650.

Might have something to do with fan profiles in vbios ... have not messed with that yet.

Just checked status on my ebay order of replacement motherboard for mining box (MOBO/RAM/CPU - Asrock H81 PRO BTC/G3220 3GHZ/4GB 1600 DDR3) and it has reached post office here in SLC Smiley should get delivered tomorrow or next day Smiley then I can bring more cards online again Smiley~

R9 280x stable at 18.47 Mhs temp 71C (1170/150) after 15+ hours Smiley
St.Neman
Jr. Member
*
Offline Offline

Activity: 30


View Profile WWW
October 02, 2014, 06:52:23 PM
 #75

Interesting observation 280x, I had to raise GPU vdc in bios to get stable above 18.4MHs (now 1.118v) ... (was stable at 1.087v using previous kernel).  Best for me 1160gpu/1500mem threads 1, TC 24576 (it does make a difference).  Temps rise dramatically with minor increase in voltage sadly and have capped my overclock attempts as over 75c is enough for me Sad  I have been unable to lower mem clock it seems to be locked (powercolor R9 280x).  Cooler ambient temps coming but will not gain all that much as temps seem to rise exponentially with overclocking (nice heater LOL).
Congrats on profitibility St.Neman I am barely profitable with increased power requirement but the 1 DMD reward window will end soon enough ... take advantage while you can.  Got new (used) Mobo on the way and will soon get main miner back online (only 1 280x atm in entertainment box right now main miner mobo died in a puff of smoke LOL).
You should be able to push 73+ Mhs with 4 cards ... (volt mod and proper heatsink paste application).

thanks utahjohn,
gpu1090, mem1000, i22, 1,169V (temp 70C fan 2000-2500)
can't get more, i try but can't Sad
i have worst gpus 280x http://www.sapphiretech.com/presentation/product/product_index.aspx?pid=2022&lid=1,
new gpus after a few months simply stop working,
then i must to change every capacitor on gpu
if i want to work again

AIDBIT
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 02, 2014, 07:43:17 PM
 #76

Interesting observation 280x, I had to raise GPU vdc in bios to get stable above 18.4MHs (now 1.118v) ... (was stable at 1.087v using previous kernel).  Best for me 1160gpu/1500mem threads 1, TC 24576 (it does make a difference).  Temps rise dramatically with minor increase in voltage sadly and have capped my overclock attempts as over 75c is enough for me Sad  I have been unable to lower mem clock it seems to be locked (powercolor R9 280x).  Cooler ambient temps coming but will not gain all that much as temps seem to rise exponentially with overclocking (nice heater LOL).
Congrats on profitibility St.Neman I am barely profitable with increased power requirement but the 1 DMD reward window will end soon enough ... take advantage while you can.  Got new (used) Mobo on the way and will soon get main miner back online (only 1 280x atm in entertainment box right now main miner mobo died in a puff of smoke LOL).
You should be able to push 73+ Mhs with 4 cards ... (volt mod and proper heatsink paste application).

thanks utahjohn,
gpu1090, mem1000, i22, 1,169V (temp 70C fan 2000-2500)
can't get more, i try but can't Sad
i have worst gpus 280x http://www.sapphiretech.com/presentation/product/product_index.aspx?pid=2022&lid=1,
new gpus after a few months simply stop working,
then i must to change every capacitor on gpu
if i want to work again
Are you tuning each card individually (there is great varience between GPU quality on each card), try to tune them one by one in your config. 
Smoked capacitors Huh is your power unstable or PS faulty, I have never had a blown capacitor.
St.Neman
Jr. Member
*
Offline Offline

Activity: 30


View Profile WWW
October 02, 2014, 10:06:00 PM
 #77

Interesting observation 280x, I had to raise GPU vdc in bios to get stable above 18.4MHs (now 1.118v) ... (was stable at 1.087v using previous kernel).  Best for me 1160gpu/1500mem threads 1, TC 24576 (it does make a difference).  Temps rise dramatically with minor increase in voltage sadly and have capped my overclock attempts as over 75c is enough for me Sad  I have been unable to lower mem clock it seems to be locked (powercolor R9 280x).  Cooler ambient temps coming but will not gain all that much as temps seem to rise exponentially with overclocking (nice heater LOL).
Congrats on profitibility St.Neman I am barely profitable with increased power requirement but the 1 DMD reward window will end soon enough ... take advantage while you can.  Got new (used) Mobo on the way and will soon get main miner back online (only 1 280x atm in entertainment box right now main miner mobo died in a puff of smoke LOL).
You should be able to push 73+ Mhs with 4 cards ... (volt mod and proper heatsink paste application).

thanks utahjohn,
gpu1090, mem1000, i22, 1,169V (temp 70C fan 2000-2500)
can't get more, i try but can't Sad
i have worst gpus 280x http://www.sapphiretech.com/presentation/product/product_index.aspx?pid=2022&lid=1,
new gpus after a few months simply stop working,
then i must to change every capacitor on gpu
if i want to work again
Are you tuning each card individually (there is great varience between GPU quality on each card), try to tune them one by one in your config. 
Smoked capacitors Huh is your power unstable or PS faulty, I have never had a blown capacitor.


my equipment and power is absolutely ok, another's gpu's working perfectly, bad chinese capacitors is my only and big problem Sad

AIDBIT
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 03, 2014, 11:26:22 AM
 #78

@st.neman
Tried installing catalyst 14.9 and it totally borked hashrate using super kernel. 
Stick with catalyst 14.7 RC3 Smiley
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 03, 2014, 11:30:37 AM
 #79

@st.neman
Tried installing catalyst 14.9 and it totally borked hashrate using super kernel. 
Stick with catalyst 14.7 RC3 Smiley

I've experienced the hashrate drop on the new drivers too. I'll look into it: there might need to be two versions of the kernel.
You can still use the new drivers but the old bin files, it should work.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 03, 2014, 11:45:47 AM
 #80

@st.neman
Tried installing catalyst 14.9 and it totally borked hashrate using super kernel. 
Stick with catalyst 14.7 RC3 Smiley

I've experienced the hashrate drop on the new drivers too. I'll look into it: there might need to be two versions of the kernel.
You can still use the new drivers but the old bin files, it should work.
Didn't try old bin, but I reverted to 14.7RC3 (used DDU) and no problem Smiley
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 03, 2014, 11:49:50 AM
 #81

I suggest everybody keep "good binaries": make a backup of the fastest .bin files you have, so you can recover them in case of driver problems.
Also this will enable you to get 1 o 2 percent more hashrate because of compiler variance (try removing the bin and running 3/4 times to see the variance in action).
Or use the provided bin file (see the OP) which should be a good one.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 03, 2014, 11:54:53 AM
 #82

I suggest everybody keep "good binaries": make a backup of the fastest .bin files you have, so you can recover them in case of driver problems.
Also this will enable you to get 1 o 2 percent more hashrate because of compiler variance (try removing the bin and running 3/4 times to see the variance in action).
Or use the provided bin file (see the OP) which should be a good one.

I've added a "final advices" section to the second post with the quoted text and this one:
"I've experienced lower power usage with catalyst 14.9 compared to 14.6 beta (a bit less compared to 13 but still better). Speaking of optimization, this should be kept in mind: buy a power meter for your miner(s)!"

zgor
Newbie
*
Offline Offline

Activity: 6


View Profile
October 03, 2014, 12:18:12 PM
 #83

Hi,

4gpu 280x = 69Mhs

This is for DMD, right? Could you share your full set of parameters passed to the sgminer?

Me, on 280x I can't get past 7.45 Mhs, with the sph-sgminer version 4.1.0-105-gf02f. The hash rate increase due this optimized kernel has been about 7% only.

Thanks.
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 03, 2014, 12:20:08 PM
 #84

Hi,

4gpu 280x = 69Mhs

This is for DMD, right? Could you share your full set of parameters passed to the sgminer?

Me, on 280x I can't get past 7.45 Mhs, with the sph-sgminer version 4.1.0-105-gf02f. The hash rate increase due this optimized kernel has been about 7% only.

Thanks.
That is right in the MHs range I was getting with catalyst 14.9 driver ... use 14.7 RC3 or the bin provided by pallas and see what you get.

Here's my bin for 280x at 1170/150 getting 18.46MHs
 https://mega.co.nz/#!pNcTnCLA!dszQHHMsK9RQngPQtQKsvKvvGz8KNqhPSO8HWwZNxD4
zgor
Newbie
*
Offline Offline

Activity: 6


View Profile
October 03, 2014, 12:36:11 PM
 #85

Hi,

4gpu 280x = 69Mhs

This is for DMD, right? Could you share your full set of parameters passed to the sgminer?

Me, on 280x I can't get past 7.45 Mhs, with the sph-sgminer version 4.1.0-105-gf02f. The hash rate increase due this optimized kernel has been about 7% only.

Thanks.
That is right in the MHs range I was getting with catalyst 14.9 driver ... use 14.7 RC3 or the bin provided by pallas and see what you get

Could not locate 14.7RC3 (I am on Linux), only 14.4, will try. The bin provided by pallas is for Hawaii (290), not for Tahiti (280), correct?
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 03, 2014, 12:41:42 PM
 #86

Hi,

4gpu 280x = 69Mhs

This is for DMD, right? Could you share your full set of parameters passed to the sgminer?

Me, on 280x I can't get past 7.45 Mhs, with the sph-sgminer version 4.1.0-105-gf02f. The hash rate increase due this optimized kernel has been about 7% only.

Thanks.
That is right in the MHs range I was getting with catalyst 14.9 driver ... use 14.7 RC3 or the bin provided by pallas and see what you get

Could not locate 14.7RC3 (I am on Linux), only 14.4, will try. The bin provided by pallas is for Hawaii (290), not for Tahiti (280), correct?
Check my prev post I added a link to tahiti bin
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 03, 2014, 12:48:29 PM
 #87

Hi,

4gpu 280x = 69Mhs

This is for DMD, right? Could you share your full set of parameters passed to the sgminer?

Me, on 280x I can't get past 7.45 Mhs, with the sph-sgminer version 4.1.0-105-gf02f. The hash rate increase due this optimized kernel has been about 7% only.

Thanks.
That is right in the MHs range I was getting with catalyst 14.9 driver ... use 14.7 RC3 or the bin provided by pallas and see what you get

Could not locate 14.7RC3 (I am on Linux), only 14.4, will try. The bin provided by pallas is for Hawaii (290), not for Tahiti (280), correct?

this is 14.9 (non-beta) for linux:

http://support.amd.com/en-us/download/desktop?os=Linux+x86

the hawaii binary should work on tahiti as well.

Feneusens
Sr. Member
****
Offline Offline

Activity: 277


View Profile
October 03, 2014, 01:24:50 PM
 #88

Hi,

4gpu 280x = 69Mhs

This is for DMD, right? Could you share your full set of parameters passed to the sgminer?

Me, on 280x I can't get past 7.45 Mhs, with the sph-sgminer version 4.1.0-105-gf02f. The hash rate increase due this optimized kernel has been about 7% only.

Thanks.
That is right in the MHs range I was getting with catalyst 14.9 driver ... use 14.7 RC3 or the bin provided by pallas and see what you get

Could not locate 14.7RC3 (I am on Linux), only 14.4, will try. The bin provided by pallas is for Hawaii (290), not for Tahiti (280), correct?

this is 14.9 (non-beta) for linux:

http://support.amd.com/en-us/download/desktop?os=Linux+x86

the hawaii binary should work on tahiti as well.


Any benefit for upgrading to 14.9?

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 03, 2014, 01:53:26 PM
 #89

Hi,

4gpu 280x = 69Mhs

This is for DMD, right? Could you share your full set of parameters passed to the sgminer?

Me, on 280x I can't get past 7.45 Mhs, with the sph-sgminer version 4.1.0-105-gf02f. The hash rate increase due this optimized kernel has been about 7% only.

Thanks.
That is right in the MHs range I was getting with catalyst 14.9 driver ... use 14.7 RC3 or the bin provided by pallas and see what you get

Could not locate 14.7RC3 (I am on Linux), only 14.4, will try. The bin provided by pallas is for Hawaii (290), not for Tahiti (280), correct?

this is 14.9 (non-beta) for linux:

http://support.amd.com/en-us/download/desktop?os=Linux+x86

the hawaii binary should work on tahiti as well.


Any benefit for upgrading to 14.9?

using the old binary, made with 14.6, I noticed less power usage with 14.9.

DMDCreeper
Full Member
***
Offline Offline

Activity: 139


View Profile
October 03, 2014, 03:48:43 PM
 #90

I am getting better hash rates with this than with grs-sgminer (solo mining).  

R290  - 20.8 mh/s vs 18.3 mh/s with grs
R280x - 17.3 mh/s vs 14.6 mh/s with grs

I am not seeing much advantage to increase intensity above 20.

I am seeing that both cards are running hotter - using more power.  I have the 14.6 beta driver package installed.

However, I have two questions about what I am seeing.

1.  My R290s are set to 1100 for GPU speed, yet they are actually running under 1000.  I am seeing speeds of only 937.  auto gpu is disabled.  I increased powertune to 20 and the gpu speed increased slightly but not to the speed setting of 1100. Can anyone explain this?  

2.  I set the memory speeds on the R290s to 150.  With grs-sgminer I don't have to, they run at that speed simply because the memory is not used.  With this software, the memory on the R290s runs are 1250 and I cannot change that.  On the R280X, the memory is running at 1500 and I cannot lower it no matter what I set it to.  Can anyone explain this?

BTW, I was using the 13.12 driver package before and I could not get the hash rates that I am getting with the newer drivers.  The best I could get out of the R290s was about 13.8 mh/s.  With grs-sgminer and the 13.12 drivers I was getting 18.3 mh/s.
zgor
Newbie
*
Offline Offline

Activity: 6


View Profile
October 03, 2014, 04:13:54 PM
 #91

Here's my bin for 280x at 1170/150 getting 18.46MHs
 https://mega.co.nz/#!pNcTnCLA!dszQHHMsK9RQngPQtQKsvKvvGz8KNqhPSO8HWwZNxD4

Thanks! It's running stable, yielding 17.5 MHs for my 280x at 1129/1000 (could not lower the ram speed below that) and the 14.4 Catalyst drivers.
zgor
Newbie
*
Offline Offline

Activity: 6


View Profile
October 03, 2014, 04:27:49 PM
 #92

this is 14.9 (non-beta) for linux:

Plan on trying it out.


the hawaii binary should work on tahiti as well.

Indeed it runs, but gives no MH increase :-(   Is it 64bits vs 32bits of the one provided by utahjohn?
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 03, 2014, 04:39:17 PM
 #93

this is 14.9 (non-beta) for linux:

Plan on trying it out.


the hawaii binary should work on tahiti as well.

Indeed it runs, but gives no MH increase :-(   Is it 64bits vs 32bits of the one provided by utahjohn?
Mine 64bit windows 14.7RC3 specifically compiled for 280x

Let me know if 14.9 with compiled (14.7) bin helps power consumption if you test it.

Also see https://bitcointalk.org/index.php?topic=779598.msg9043545#msg9043545 about forcing lower mem clock by vbios mod Smiley
Always keep a backup copy of original vbios for each specific card if you need to reflash back to normal for resale ...
If you can't get linux tools to mod the vbios, PM me for options.
Also note that vbios is card mfr specific, so I can't just give a .ROM for Powercolor card and have it work on a different card ...
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 03, 2014, 05:11:52 PM
 #94

I am getting better hash rates with this than with grs-sgminer (solo mining).  

R290  - 20.8 mh/s vs 18.3 mh/s with grs
R280x - 17.3 mh/s vs 14.6 mh/s with grs

I am not seeing much advantage to increase intensity above 20.

I am seeing that both cards are running hotter - using more power.  I have the 14.6 beta driver package installed.

However, I have two questions about what I am seeing.

1.  My R290s are set to 1100 for GPU speed, yet they are actually running under 1000.  I am seeing speeds of only 937.  auto gpu is disabled.  I increased powertune to 20 and the gpu speed increased slightly but not to the speed setting of 1100. Can anyone explain this?  

2.  I set the memory speeds on the R290s to 150.  With grs-sgminer I don't have to, they run at that speed simply because the memory is not used.  With this software, the memory on the R290s runs are 1250 and I cannot change that.  On the R280X, the memory is running at 1500 and I cannot lower it no matter what I set it to.  Can anyone explain this?

BTW, I was using the 13.12 driver package before and I could not get the hash rates that I am getting with the newer drivers.  The best I could get out of the R290s was about 13.8 mh/s.  With grs-sgminer and the 13.12 drivers I was getting 18.3 mh/s.

1. High ram clock = lower Max core clock
2. If on Linux try:

http://epixoip.github.io/od6config/

zgor
Newbie
*
Offline Offline

Activity: 6


View Profile
October 03, 2014, 07:38:09 PM
 #95

Let me know if 14.9 with compiled (14.7) bin helps power consumption if you test it.

With your .bin, and 14.9, the power consumption appear very slightly higher, but within 1%, and the hashrate is up about 0.6%.
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 03, 2014, 07:47:52 PM
 #96

Let me know if 14.9 with compiled (14.7) bin helps power consumption if you test it.

With your .bin, and 14.9, the power consumption appear very slightly higher, but within 1%, and the hashrate is up about 0.6%.
I'll stick with 14.7 then (I run other algos's also and have need to compile bin's).
DMDCreeper
Full Member
***
Offline Offline

Activity: 139


View Profile
October 03, 2014, 09:30:32 PM
 #97

I am getting better hash rates with this than with grs-sgminer (solo mining).  

R290  - 20.8 mh/s vs 18.3 mh/s with grs
R280x - 17.3 mh/s vs 14.6 mh/s with grs

I am not seeing much advantage to increase intensity above 20.

I am seeing that both cards are running hotter - using more power.  I have the 14.6 beta driver package installed.

However, I have two questions about what I am seeing.

1.  My R290s are set to 1100 for GPU speed, yet they are actually running under 1000.  I am seeing speeds of only 937.  auto gpu is disabled.  I increased powertune to 20 and the gpu speed increased slightly but not to the speed setting of 1100. Can anyone explain this?  

2.  I set the memory speeds on the R290s to 150.  With grs-sgminer I don't have to, they run at that speed simply because the memory is not used.  With this software, the memory on the R290s runs are 1250 and I cannot change that.  On the R280X, the memory is running at 1500 and I cannot lower it no matter what I set it to.  Can anyone explain this?

BTW, I was using the 13.12 driver package before and I could not get the hash rates that I am getting with the newer drivers.  The best I could get out of the R290s was about 13.8 mh/s.  With grs-sgminer and the 13.12 drivers I was getting 18.3 mh/s.

1. High ram clock = lower Max core clock
2. If on Linux try:

http://epixoip.github.io/od6config/

I should have noted I am running Windows 7 64 bit.  I am trying to lower the RAM speed but it is not budging no matter what settings I try.  Likewise the GPU core clock is staying under 1000.

I do not have this problem with grs-sgminer miner. 

Even so, this package is giving me a higher hash rate than grs-sgminer so I will stay with it.
anatolikostis
Legendary
*
Offline Offline

Activity: 1946



View Profile
October 03, 2014, 10:01:35 PM
 #98

I should have noted I am running Windows 7 64 bit.  I am trying to lower the RAM speed but it is not budging no matter what settings I try.  Likewise the GPU core clock is staying under 1000.

I do not have this problem with grs-sgminer miner. 

Even so, this package is giving me a higher hash rate than grs-sgminer so I will stay with it.
your gpu freq. fluct. is caused by driver power limit function.
just try to reduce gpu voltage - may be 2-3 steps backward... Wink
optiplex
Newbie
*
Offline Offline

Activity: 17


View Profile WWW
October 05, 2014, 05:05:20 AM
 #99

I should have noted I am running Windows 7 64 bit.  I am trying to lower the RAM speed but it is not budging no matter what settings I try.  Likewise the GPU core clock is staying under 1000.

I do not have this problem with grs-sgminer miner. 

Even so, this package is giving me a higher hash rate than grs-sgminer so I will stay with it.
your gpu freq. fluct. is caused by driver power limit function.
just try to reduce gpu voltage - may be 2-3 steps backward... Wink
That sound like a total line of crap to me.
qaz6767
Full Member
***
Offline Offline

Activity: 138


View Profile
October 05, 2014, 07:04:13 AM
 #100

Help set up R9 290 Tri-x! Thank you!
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 05, 2014, 07:11:56 AM
 #101

Help set up R9 290 Tri-x! Thank you!
Read page 1 then ask question if help needed Smiley  (Sorry for being ill-tempered, I just had to block a cred card due to fraudulent charges being made on it, cash only till new card arrives).
qaz6767
Full Member
***
Offline Offline

Activity: 138


View Profile
October 05, 2014, 09:02:58 AM
 #102

I can not replace the bin file. After restarting the miner, is presented again the old bin file. How to replace it? Thank you
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 05, 2014, 11:05:43 AM
 #103

I can not replace the bin file. After restarting the miner, is presented again the old bin file. How to replace it? Thank you

if a bin file with the same name already exists, it shouldn't replace it.
so best replace the bin file the miner creates with mine, using the same filename.

qaz6767
Full Member
***
Offline Offline

Activity: 138


View Profile
October 05, 2014, 01:58:08 PM
 #104

I can not replace the bin file. After restarting the miner, is presented again the old bin file. How to replace it? Thank you

if a bin file with the same name already exists, it shouldn't replace it.
so best replace the bin file the miner creates with mine, using the same filename.
Thanks!!!
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 07, 2014, 08:02:36 AM
 #105

It looks like the compiler included in 14.9 drivers produces binaries which run considerably slower than older releases.
The same happens for other algorythms as well, for example on X11.
I've tweaked the code a bit but I still can't reach full speed, so I will keep on trying or, eventually, wait for a new driver release.
Meanwhile, if you are on 14.9, use the provided bin file instead of the kernel source in .cl format.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 09, 2014, 10:01:33 PM
 #106

Any chance of getting a worksize 128 super optimized kernel to try on HD5450? (256 too large)
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 10, 2014, 07:40:39 AM
 #107

Any chance of getting a worksize 128 super optimized kernel to try on HD5450? (256 too large)

The changes needed to make it work at 128 are easy, but it probably won't be tuned well for such a card: I've tested on r9 290 and 7950 while developing. It might even not work at all.
If you want to try I can send you a file or the changes and if it works well we can post it here.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 11, 2014, 01:24:53 AM
 #108

been busy working with miningfield setting up my USA mirror of pools.
yah if you can post a modifed .cl for ws 128 I'll test it on 5450

Made a lot of progress on USA mirror setup today Smiley might have it online by monday Smiley
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 11, 2014, 08:58:16 AM
 #109

been busy working with miningfield setting up my USA mirror of pools.
yah if you can post a modifed .cl for ws 128 I'll test it on 5450

Made a lot of progress on USA mirror setup today Smiley might have it online by monday Smiley

just change the initial part of the main function to:

for (u = get_local_id(0); u < 256; u += get_local_size(0)) {
  T2 = ROTL64(T0, 16UL);
  T3 = ROTL64(T0, 24UL);
  T4 = ROTL64(T0, 32UL);
  T5 = ROTL64(T0, 40UL);
  T6 = ROTL64(T0, 48UL);
  T7 = ROTL64(T0, 56UL);
}

this part was blocking worksize < 256.
as I said previously, it still might not work or be very slow for tuning reasons.
let me know of it works.
thanks!

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
October 11, 2014, 09:05:31 AM
 #110

been busy working with miningfield setting up my USA mirror of pools.
yah if you can post a modifed .cl for ws 128 I'll test it on 5450

Made a lot of progress on USA mirror setup today Smiley might have it online by monday Smiley

just change the initial part of the main function to:

for (u = get_local_id(0); u < 256; u += get_local_size(0)) {
  T2 = ROTL64(T0, 16UL);
  T3 = ROTL64(T0, 24UL);
  T4 = ROTL64(T0, 32UL);
  T5 = ROTL64(T0, 40UL);
  T6 = ROTL64(T0, 48UL);
  T7 = ROTL64(T0, 56UL);
}

this part was blocking worksize < 256.
as I said previously, it still might not work or be very slow for tuning reasons.
let me know of it works.
thanks!
thanks will do as soon as I get time Smiley  VM server box has 5450 in it might as well let the host make use of it, I can get ~0.25MHs with normal gorestlcoin kernel with ws 128 on it ... it's running 24/7/365 anyway Smiley

Only has 80 shaders LOL it's a dwarf but is air cooled hehe
about on par with intel HD GPU (10 shaders) in G3220 CPU as far as hashrate

qaz6767
Full Member
***
Offline Offline

Activity: 138


View Profile
October 20, 2014, 12:26:52 PM
 #111

Help! What the bat file to start Diamond? I can not run for card 280x.Thanks
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
October 20, 2014, 12:29:45 PM
 #112

been busy working with miningfield setting up my USA mirror of pools.
yah if you can post a modifed .cl for ws 128 I'll test it on 5450

Made a lot of progress on USA mirror setup today Smiley might have it online by monday Smiley

just change the initial part of the main function to:

for (u = get_local_id(0); u < 256; u += get_local_size(0)) {
  T2 = ROTL64(T0, 16UL);
  T3 = ROTL64(T0, 24UL);
  T4 = ROTL64(T0, 32UL);
  T5 = ROTL64(T0, 40UL);
  T6 = ROTL64(T0, 48UL);
  T7 = ROTL64(T0, 56UL);
}

this part was blocking worksize < 256.
as I said previously, it still might not work or be very slow for tuning reasons.
let me know of it works.
thanks!
thanks will do as soon as I get time Smiley  VM server box has 5450 in it might as well let the host make use of it, I can get ~0.25MHs with normal gorestlcoin kernel with ws 128 on it ... it's running 24/7/365 anyway Smiley

Only has 80 shaders LOL it's a dwarf but is air cooled hehe
about on par with intel HD GPU (10 shaders) in G3220 CPU as far as hashrate

just curious... did you manage to make it work? if yes, what hashrate?

EDIT: it has about half the shaders of a nexus 9 :-D

Wolf0
Legendary
*
Online Online

Activity: 1666


Miner Developer


View Profile
November 04, 2014, 12:43:07 PM
 #113

A BIT OF HISTORY

The first gpu miner for groestlcoin and similar was sph-sgminer by phm. Optimizing the original implementation was trivial (almost 3x the speed could be achived!), so probably there are tens of optimized versions around, many of which have been kept private: mining groestlcoin and similar was always unfair for most people, at least for non-devs.
Hopefully this kernel will end this and should also level the field between amd and nvidia.
I believe my version is faster than many of the other kernels because of the time I dedicated to it and the thousands of tests I did.

FINAL ADVICES

I suggest to keep "good binaries": make a backup of the fastest .bin files you have, so you can recover them in case of driver problems.
Also this will enable you to get 1 o 2 percent more hashrate because of compiler variance (try removing the bin and running 3/4 times to see the variance in action).
Or use the provided bin file (see the OP) which should be a good one.

I've experienced lower power usage with catalyst 14.9 compared to 14.6 beta (a bit less compared to 13 but still better). Speaking of optimization, this should be kept in mind: buy a power meter for your miner(s)!
But it looks like the compiler included in 14.9 drivers produces binaries which run considerably slower than older releases. If you are on 14.9, use the provided bin file instead of the kernel source in .cl format.

If you want to fix it for 14.9, remove the naive implementation of the B64_# macros and use swizzle. Worked for me.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
November 04, 2014, 12:50:23 PM
 #114

A BIT OF HISTORY

The first gpu miner for groestlcoin and similar was sph-sgminer by phm. Optimizing the original implementation was trivial (almost 3x the speed could be achived!), so probably there are tens of optimized versions around, many of which have been kept private: mining groestlcoin and similar was always unfair for most people, at least for non-devs.
Hopefully this kernel will end this and should also level the field between amd and nvidia.
I believe my version is faster than many of the other kernels because of the time I dedicated to it and the thousands of tests I did.

FINAL ADVICES

I suggest to keep "good binaries": make a backup of the fastest .bin files you have, so you can recover them in case of driver problems.
Also this will enable you to get 1 o 2 percent more hashrate because of compiler variance (try removing the bin and running 3/4 times to see the variance in action).
Or use the provided bin file (see the OP) which should be a good one.

I've experienced lower power usage with catalyst 14.9 compared to 14.6 beta (a bit less compared to 13 but still better). Speaking of optimization, this should be kept in mind: buy a power meter for your miner(s)!
But it looks like the compiler included in 14.9 drivers produces binaries which run considerably slower than older releases. If you are on 14.9, use the provided bin file instead of the kernel source in .cl format.

If you want to fix it for 14.9, remove the naive implementation of the B64_# macros and use swizzle. Worked for me.

Thanks, but I've already tried any combination of bitwise operations and vectors (as_uchar...): I could make it work but hashrate is about 20 Mh/s vs 25 Mh/s of 14.6 beta.

Wolf0
Legendary
*
Online Online

Activity: 1666


Miner Developer


View Profile
November 04, 2014, 01:09:20 PM
 #115

A BIT OF HISTORY

The first gpu miner for groestlcoin and similar was sph-sgminer by phm. Optimizing the original implementation was trivial (almost 3x the speed could be achived!), so probably there are tens of optimized versions around, many of which have been kept private: mining groestlcoin and similar was always unfair for most people, at least for non-devs.
Hopefully this kernel will end this and should also level the field between amd and nvidia.
I believe my version is faster than many of the other kernels because of the time I dedicated to it and the thousands of tests I did.

FINAL ADVICES

I suggest to keep "good binaries": make a backup of the fastest .bin files you have, so you can recover them in case of driver problems.
Also this will enable you to get 1 o 2 percent more hashrate because of compiler variance (try removing the bin and running 3/4 times to see the variance in action).
Or use the provided bin file (see the OP) which should be a good one.

I've experienced lower power usage with catalyst 14.9 compared to 14.6 beta (a bit less compared to 13 but still better). Speaking of optimization, this should be kept in mind: buy a power meter for your miner(s)!
But it looks like the compiler included in 14.9 drivers produces binaries which run considerably slower than older releases. If you are on 14.9, use the provided bin file instead of the kernel source in .cl format.

If you want to fix it for 14.9, remove the naive implementation of the B64_# macros and use swizzle. Worked for me.

Thanks, but I've already tried any combination of bitwise operations and vectors (as_uchar...): I could make it work but hashrate is about 20 Mh/s vs 25 Mh/s of 14.6 beta.

Ah, I see - I just saw it go from 7MH/s to... I think 20, on 14.9, so I figured it worked; never mind, then.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
November 04, 2014, 01:27:57 PM
 #116

Thanks, but I've already tried any combination of bitwise operations and vectors (as_uchar...): I could make it work but hashrate is about 20 Mh/s vs 25 Mh/s of 14.6 beta.
Ah, I see - I just saw it go from 7MH/s to... I think 20, on 14.9, so I figured it worked; never mind, then.

It's funny how some little changes lead to huge hashrate drops (depending on compiler version); but it's true for memory intensive algos only, as far as I can see.
Maybe your own version doesn't have this problem, then ;-)

Wolf0
Legendary
*
Online Online

Activity: 1666


Miner Developer


View Profile
November 04, 2014, 01:43:39 PM
 #117

Thanks, but I've already tried any combination of bitwise operations and vectors (as_uchar...): I could make it work but hashrate is about 20 Mh/s vs 25 Mh/s of 14.6 beta.
Ah, I see - I just saw it go from 7MH/s to... I think 20, on 14.9, so I figured it worked; never mind, then.

It's funny how some little changes lead to huge hashrate drops (depending on compiler version); but it's true for memory intensive algos only, as far as I can see.
Maybe your own version doesn't have this problem, then ;-)

Not true for only memory intensive algos - one little screwup and the idiot compiler will double the size of your code, it won't fit in the code cache, and be slow lol

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
cryptonit
Legendary
*
Offline Offline

Activity: 1344


CVO Diamond Foundation (Visionary)


View Profile WWW
November 16, 2014, 07:31:32 AM
 #118


cryptonit
Legendary
*
Offline Offline

Activity: 1344


CVO Diamond Foundation (Visionary)


View Profile WWW
November 23, 2014, 12:23:11 PM
 #119

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
November 23, 2014, 02:39:19 PM
 #120

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome


the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Wolf0
Legendary
*
Online Online

Activity: 1666


Miner Developer


View Profile
November 23, 2014, 03:06:50 PM
 #121

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome


the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Just an aside - I've gotten the same results - 21MH/s vs. 25MH/s. It's frustrating - but all I've tried is the lookup table implementation, so far.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
November 23, 2014, 04:15:49 PM
 #122

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome


the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Just an aside - I've gotten the same results - 21MH/s vs. 25MH/s. It's frustrating - but all I've tried is the lookup table implementation, so far.

Well, that means there is probably little room for improvements on that kind of implementation.
I'm curious to see if a bitslice version can be faster on AMD gpus, but I have no time (and no interest because of negative revenue) to try it myself.

Wolf0
Legendary
*
Online Online

Activity: 1666


Miner Developer


View Profile
November 23, 2014, 04:18:11 PM
 #123

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome


the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Just an aside - I've gotten the same results - 21MH/s vs. 25MH/s. It's frustrating - but all I've tried is the lookup table implementation, so far.

Well, that means there is probably little room for improvements on that kind of implementation.
I'm curious to see if a bitslice version can be faster on AMD gpus, but I have no time (and no interest because of negative revenue) to try it myself.

I think it might be - 14.9 killed my X11 hashrate at first, down from 10MH/s on 290X to 2 point something. After redesigning Groestl, still based on lookup tables, I got it back up to 6.5MH/s or so. Still dismal...

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
December 27, 2014, 10:18:22 PM
 #124

Could someone please share their hashrate with r9 285? I'm curious to see if it outperforms the 280 and how much power it uses.

lpedretti
Full Member
***
Offline Offline

Activity: 147


View Profile
December 29, 2014, 04:19:19 PM
 #125

I was having issues using the optimized cl and precompiled binaries, no HW but there were very ocassional shares and pools reported me a very low hashrate, however the problem was the sgminer version i was using, i'm now using the sgminer-develop that has neoscrypt optimized kernels and with that version it works like a charm!
Running Lubuntu 14.04 with 14.x (don't remember which one)
Clock at 930, 0.95v, 13.5 Mh/s each XFX-7970DD and Gigabyte 280x windforce

Great job!

Best regards!

AC: ANuRoFPkCjZSxsw2S41djrrA1D4xMMmwhs
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
December 31, 2014, 10:59:08 PM
 #126

Hi All,

I registered here because I need a little help from you, who develops this OpenCL kernel.
A month ago I've found the Groestl algo on the amd dev forums, thanks to Wolf0 who mentioned it on there. I thought it will be a good algo to test my skills in GCN asm, and I'd like to play with it, maybe I can optimize it better than the OCL compiler (or maybe not, but at least I can learn from it anyways).

So the help I'm seeking is this:
- Please send me the latest version of this kernel (I see everyone altering it a bit, just don't know which is which)
- And pls give me a test vector with these things:
  - global kernel dimensions, workgroup size(I guess it's 256)
  - kernel parameters: dump "char *block", and the "target" value
- And of course the above testcase must find a GroestlCoin hash.

Thank you in advance

(I already sent it to Wolf0 on the amd dev forums, but the moderation there can take more time there and later I found this more appropriate place for my question)

And have a Happy New Year, btw
Wolf0
Legendary
*
Online Online

Activity: 1666


Miner Developer


View Profile
January 01, 2015, 12:17:47 AM
 #127

Hi All,

I registered here because I need a little help from you, who develops this OpenCL kernel.
A month ago I've found the Groestl algo on the amd dev forums, thanks to Wolf0 who mentioned it on there. I thought it will be a good algo to test my skills in GCN asm, and I'd like to play with it, maybe I can optimize it better than the OCL compiler (or maybe not, but at least I can learn from it anyways).

So the help I'm seeking is this:
- Please send me the latest version of this kernel (I see everyone altering it a bit, just don't know which is which)
- And pls give me a test vector with these things:
  - global kernel dimensions, workgroup size(I guess it's 256)
  - kernel parameters: dump "char *block", and the "target" value
- And of course the above testcase must find a GroestlCoin hash.

Thank you in advance

(I already sent it to Wolf0 on the amd dev forums, but the moderation there can take more time there and later I found this more appropriate place for my question)

And have a Happy New Year, btw

I don't check there often - how exactly do you do GCN ASM? I'm interested.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 01, 2015, 09:22:12 AM
 #128

how exactly do you do GCN ASM? I'm interested.

I wrote an assembler for it. You can try it at realhet.wordpress.com. (Use Cat 13.4 or older, otherwise examples will crash.)

My first thoughts compiling the OCL kernel (on a 7770):
- Its 2.5 times bigger than the instruction cache. (and there are no loops in it, so I guess it often reads from ram.)
- T0 and T1 is located in the gpu ram.
- VReg count is above 128. -> that allows only the minimum no of 4 wavefronts/CU. So there are no
latency hiding via parallel wavefronts.
- too short kernel with too much initialization: Ideally I'd let every workgroup run for a minimum of 0.5 sec. So kernel launch and LDS table initialization would take no time compared to the actual work.
- better instructions: BitFieldExtract for 64bit rotate, ds_read2_b64 for 128 bit LDS read.
- balancing load between LDS and L1 cache

I don't know which of the above is an actual bottleneck or will be usefull, but I wanna find out.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 01, 2015, 11:04:15 PM
 #129

how exactly do you do GCN ASM? I'm interested.

I wrote an assembler for it. You can try it at realhet.wordpress.com. (Use Cat 13.4 or older, otherwise examples will crash.)

My first thoughts compiling the OCL kernel (on a 7770):
- Its 2.5 times bigger than the instruction cache. (and there are no loops in it, so I guess it often reads from ram.)
- T0 and T1 is located in the gpu ram.
- VReg count is above 128. -> that allows only the minimum no of 4 wavefronts/CU. So there are no
latency hiding via parallel wavefronts.
- too short kernel with too much initialization: Ideally I'd let every workgroup run for a minimum of 0.5 sec. So kernel launch and LDS table initialization would take no time compared to the actual work.
- better instructions: BitFieldExtract for 64bit rotate, ds_read2_b64 for 128 bit LDS read.
- balancing load between LDS and L1 cache

I don't know which of the above is an actual bottleneck or will be usefull, but I wanna find out.

I'm going to try your assembler, very interesting projects!
About your observations, first of all keep in mind that the compiler is pretty unpredictable: many optimizations just do not make sense but they work. Also I only tested it with Tahiti and Hawaii cards.
Kernel size: it can easily be made smaller (for example by including a single table instead of 2), but in all my tests it doesn't bring any advantage.
T0 and T1 are not in gpu ram: it would be much slower if they were. They are in constant ram, I believe.
Short kernel: even though you might design it in order to process multiple hashes in a single run, I think it's not worth. Simple proof: algos which are tens of times faster than groestl, like keccak, still do a single hash per kernel run. Another reason is that making the kernel last longer will result in more rejected shares.
Balancing load between local ram and cache (or whatever balancing of memory reads): I believe that many optimizations that do not make sense, work because they intrudoduce little delays that permit better memory reads between the threads. They sort of better fit together. In fact, modifying the code on other parts of the code may make the same optimization worthless. Interesting speed variations may be brought by switching instructions or grouping local ram reads differently, for example.

Hope that helps.

Atomicat
Hero Member
*****
Offline Offline

Activity: 840



View Profile
January 02, 2015, 04:10:02 AM
 #130

Learn something new every day.  My instinct is to push it till it moves, crank it to 11, but that doesn't work with the R9-290.  Doesn't work because it's throttling for power considerations long before you're hitting 1150.  Just dropped my voltages right down and finally got 23.5 at 1125, I-20.  New understanding of how to handle this card will make for better benchmarks, for true.

Oh, nice price jump today, from 60k to 70k.  Yeah, I'll take credit for that.  Put some orders up last night, woke up to find that I basically owned it on Cryptsy!  Drop a line with your DMD wallet address, I'll give you well earned reward from my ill gotten gains.


realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 02, 2015, 10:05:10 AM
 #131

"T0 and T1 are not in gpu ram: it would be much slower if they were."

Thanks for the ideas!

Actually I knew it from the disasm, that it uses ram instead of LDS for T0, T1. (Note that there is no such thing as constant memory in GCN. It can read a single value with the Scalar ALU and broadcast it across all the wavefront's workitems or it can read 64 values for a whole wavefront by the Vector ALU. Because T0 is addressed by data, it must be read by the VALU using L1 cache (there is a scalar cache too)).
And from there I had the idea of balancing the two sources (LDS and L1).

I did a simple test: renamed T0 and T1, and allocated a new T0 and a T1 from __local. And then initialized them properly. Result: all tbuffer memory read instructions disappeared from the disasm, and the hash rate is dropped from 3.99 MH/s down to 3.841. Don't know how much is the penalty of copying T0, and T1 into the LDS, though.
By the 'textbook': L1 cache can read 4bytes/cycle, LDS: 8bytes/cycle

And yes, the OpenCL compiler is totally unpredictable.

Important question: In the MH/s calculation 1 kernel thread execution means 2 Hashes, right?

(I have a HD7770 @1000MHz, and it's at 4MH/s which looks similar to Wolf0's report on dev.amd.com: R9 290 @1200 20MH/s. Using 14.9 where the compiler generates slower code.)

Now I have to convert all the math into asm. That's painful Cheesy
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 03, 2015, 03:22:57 AM
 #132

@realhet
please share your work with the rest of us if it works out that assembly optimization works out.  Looked at r9-285 review today, looks promising as long as smaller memory bus (256 vs 384) doesn't bottleneck.  Should be faster than 280x and on par or maybe even better than 290 with lower power requirement ...
May need tweaks for each architecture ... can it be written to detect which card it's running on and auto select best?

A quote from AnandTech review :
Quote
A complete Tonga configuration will contain 2048 SPs, just like its Tahiti predecessor, with 1792 of those SPs active on R9 285. This is paired with the card’s 32 ROPs attached to a 256-bit memory bus, and a 4-wide (4 geometry processor) frontend. Compared to Tahiti the most visible change is the memory bus size, which has gone from 384-bit to 256-bit. In our look at GCN 1.2 we’ll see why AMD is able to get away with this – the short answer is compression – but it’s notable since at an architectural level Tahiti had to use a memory crossbar between the ROPs and memory bus due to their mismatched size (each block of 4 ROPs wants to be paired with a 32bit memory channel). The crossbar on Tahiti exposes the cards to more memory bandwidth, but it also introduces some inefficiencies of its own that make the subject a tradeoff.

Meanwhile Tonga’s geometry frontend has received an upgrade similar to Hawaii’s, expanding the number of geometry units (and number of polygons per clock) from 2 to 4. And there are actually some additional architectural efficiency improvements in here that should further push performance per clock beyond what Hawaii can do in the real world.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 03, 2015, 01:26:36 PM
 #133

Yes they are two chained iterations of groestl.
But they run a bit different code: the first is optimised because part of the input is known in advance and the second because the whole hash is not needed.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 03, 2015, 01:28:54 PM
 #134

Is anyone willing to donate or lend a 285 so I can optimise for Tonga?

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 03, 2015, 07:34:53 PM
 #135

Hi again,

Finally I'm at the point that it first time ever produced a correct result.
The speed test was surprisingly good: HD7770 1000MHz (640 streams, GCN1.0 chip), Cat:14.9(the 20% slower driver), total workitems: 256*10*512, elapsed: 558.613 ms,  4.693 MH/s,   gain:   1.17x where the baseline is the opencl implementation (found on amd.com at Wolf0's post) which is 4.00MH/s.

And the first optimization was really a cheap shot Grin. Unlike ocl, I was able to made it under 128 VGPRS (I use 120 currently, it was kinda close). So as each Vector ALU can choose from 2 wavefronts at any time, latency hiding finally kicked in -> elapsed: 279.916 ms  9.365 MH/s   gain:   2.34x

And I'm full of ideas to try Cheesy Next will be to shrink the code to fit into the 32KB instruction cache. Now it is 300kb, it's a massive macro unroll at the moment. The original pallas' ocl version is 110kb, wonder why 3x the multiplier though. Anyways, on GCN we can have loops with only 1 clycle overheads, or even I can write subroutines with call/ret instructions, so I gotta try that fast it is when the instruction cache has no misses at all.

OpenCL thing: While I simplify the code (I chopped down the first/last round optimizations because they would be hard to implement in asm atm) I noticed that I knew already from the past: The OpenCL->llvm-> amd_il -> gcn_asm toolchain will eliminate all the constant calculations and all the calculations whose results is not used at all. I watched the times while making these modifications and it stayed around 4MH/s. Sometimes it dropped below 3.7 when I put measurement code at various places to compare the original kernel with my kernel: if(gid==1234 && flag==1) for(int i=0; i<16; ++i) output = g;
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 03, 2015, 10:21:32 PM
 #136

Hi again,

Finally I'm at the point that it first time ever produced a correct result.
The speed test was surprisingly good: HD7770 1000MHz (640 streams, GCN1.0 chip), Cat:14.9(the 20% slower driver), total workitems: 256*10*512, elapsed: 558.613 ms,  4.693 MH/s,   gain:   1.17x where the baseline is the opencl implementation (found on amd.com at Wolf0's post) which is 4.00MH/s.

And the first optimization was really a cheap shot Grin. Unlike ocl, I was able to made it under 128 VGPRS (I use 120 currently, it was kinda close). So as each Vector ALU can choose from 2 wavefronts at any time, latency hiding finally kicked in -> elapsed: 279.916 ms  9.365 MH/s   gain:   2.34x

And I'm full of ideas to try Cheesy Next will be to shrink the code to fit into the 32KB instruction cache. Now it is 300kb, it's a massive macro unroll at the moment. The original pallas' ocl version is 110kb, wonder why 3x the multiplier though. Anyways, on GCN we can have loops with only 1 clycle overheads, or even I can write subroutines with call/ret instructions, so I gotta try that fast it is when the instruction cache has no misses at all.

OpenCL thing: While I simplify the code (I chopped down the first/last round optimizations because they would be hard to implement in asm atm) I noticed that I knew already from the past: The OpenCL->llvm-> amd_il -> gcn_asm toolchain will eliminate all the constant calculations and all the calculations whose results is not used at all. I watched the times while making these modifications and it stayed around 4MH/s. Sometimes it dropped below 3.7 when I put measurement code at various places to compare the original kernel with my kernel: if(gid==1234 && flag==1) for(int i=0; i<16; ++i) output = g;

Great progress, very interesting!
The first improvement, 1.17x, is about the same as the 20% that is lost on 14.9 compared to 14.6 beta, so the two implementations are equivalent.
The second, 2.34x, is really impressive: I have tried multiple times to reduce the number of variables as much as possible (down to 3x16 ulong arrays, 2 ulong and 2 uint), but the results were always worse, so probably that improvement can't be implemented in opencl, or at least I don't know how to.
The same for code size and instruction cache: I was able to squeeze it to about 50K, but at a speed loss.
About the compiler than can eliminate the constant calculations: I noticed that, but doing it by hand works best both in terms of speed and kernel size.
Finally, a question about your work: do you plan to opensource it?

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 04, 2015, 11:26:22 PM
 #137

Hi,

The Groestl asm code is opensource (I just uploaded it). My compiler and IDE is closed source though, but once you compiled the kernel with it into an .ELF binary, you can use it even on Linux, not just Win.

The first asm version is documented on my blog. Check it out here -> http://realhet.wordpress.com/
It's only a development version, and the kernel parameters are incompatible with Pallas's OpenCL kernel. I have a hard time reverse engineering how params are passed through registers, not mentioning that it can be different in every catalyst version so I keep parameters simple. One buffer with pinned memory for everything data IO is the fastest anyways.
I'm planning to post about many optimizations. Let's see how far can I go. With using only 128 VGPRS it is already at 2.3x speedup and I'm expecting more. Grin
I believe that OCL is so generalized and is kinda far from the actual GCN hardware that it is worth for some projects to go low level. (Not all projects: For example I have failed with LiteCoin. It's better for it to stay in maintainable OCL code.)
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 05, 2015, 05:12:13 PM
 #138

First 2 optimizations are done, I wrote a blog post about them. I'm at 2.65x now.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 05, 2015, 06:29:19 PM
 #139

First 2 optimizations are done, I wrote a blog post about them. I'm at 2.65x now.

Thanks very much!
Unfortunately it appears the two optimisations are hard to implement in opencl: minimum code size I was able to achieve was 50K, far from 32k, and reducing the number of variables as much as possible didn't provide any speed up. Maybe the number of vregs is still higher than 128...

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 06, 2015, 04:39:59 AM
 #140

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 06, 2015, 04:49:51 AM
 #141

(Oups an important part was missing in my blogpost -> now it's corrected)
Wolf0
Legendary
*
Online Online

Activity: 1666


Miner Developer


View Profile
January 06, 2015, 06:12:39 AM
 #142

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

You first need to have the target passed to the kernel.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 06, 2015, 11:35:27 AM
 #143

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

Thanks for the update.
I've been using Linux only for many years now, so I can't help you on windows compiling; just know it's trivial to compile the miner on linux, it runs on a terminal so doesn't need qt.
About the software version, I prefer the good old sph-sgminer which is based on sgminer 4.1, (I modified it a bit)  but you can use the latest sgminer 5.X as well.
To test the kernel you can simply point it to a pool, printf the hash or whatever.
Back to my opencl effort, I've reduced the number of vgprs to 147 but I'm struggling to get past that.

sp_
Legendary
*
Offline Offline

Activity: 1134

Ccminer developer


View Profile
January 06, 2015, 01:06:50 PM
 #144

Does you assembler support self modifying code? Wink Then you can use the instruction cache as a precalc buffer as well. The advantage is that most gpu's can read from the inst cache in paralell to the level 1 cache.

BTC: 1CTiNJyoUmbdMRACtteRWXhGqtSETYd6Vd
qwep1
Hero Member
*****
Online Online

Activity: 490


View Profile
January 06, 2015, 03:38:19 PM
 #145

and will be a version for windows  Smiley
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 07, 2015, 10:34:17 PM
 #146

On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)
I believe the asm version is still faster on hawaii, and of course much faster on smaller cards.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 07, 2015, 11:37:09 PM
 #147

new optimized CL or a BIN? (I'll test on 280x and 7950).
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 07:20:00 AM
 #148

"On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)"

Then it got all the goodies: vgprs, icache and 2ram+6lds reads. The speedup must be the same 3.5x! Is it that much?

It must be good on small cards either, only important difference is the number of CUs anyways.
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 07:36:35 AM
 #149

And you have the first/last round optimizations so it must be faster!
If it's as fast as the asm version, then I don't have to deal with the kernel parameters, which is boring/painful. My asm was only needed to encourage you to shrink the code/regs. Cheesy

Can you share the new source?
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 08, 2015, 09:28:52 AM
 #150

And you have the first/last round optimizations so it must be faster!
If it's as fast as the asm version, then I don't have to deal with the kernel parameters, which is boring/painful. My asm was only needed to encourage you to shrink the code/regs. Cheesy

Can you share the new source?

Unfortunately it's only about 25% faster, but we should compare apples to apples: could you try your code on hawaii chipset so we have a constant testbed?
Now I'm working on further first round optimizations, they bring little improvement but it's still worth imho.

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 09:59:22 AM
 #151

25% seems like only that loss coming back which is lost with the 14.9. I really thought you had it 3.5x faster.

Are you sure that it only uses 123VGPRS AND code size is 28KB only? Or does it started to use Scratch regs (those are terribly slow)?

Unfortunatelly I can't try on anything else than HD7770. But I'd also like to see how it runs on faster systems. I uploaded it onto my blog in the download area if someone wish to try it. I'm not familiar with the latest GCN chips (I think AMD only improve their instruction from time to time, and maybe cut down double precision performance), but with this particular program, I'm pretty sure that it will bring the 3.48x speedup on the R9 290x too. Because all the CUs can work alone using LDS and L1 cache and ICache on their own, that's why. So if current ocl code on the R9 290x runs at 20MH/s then the latest asm code should be run at 70MH/s.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 08, 2015, 10:03:12 AM
 #152

25% seems like only that loss coming back which is lost with the 14.9. I really thought you had it 3.5x faster.

Are you sure that it only uses 123VGPRS AND code size is 28KB only? Or does it started to use Scratch regs (those are terribly slow)?

Unfortunatelly I can't try on anything else than HD7770. But I'd also like to see how it runs on faster systems. I uploaded it onto my blog in the download area if someone wish to try it. I'm not familiar with the latest GCN chips (I think AMD only improve their instruction from time to time, and maybe cut down double precision performance), but with this particular program, I'm pretty sure that it will bring the 3.48x speedup on the R9 290x too. Because all the CUs can work alone using LDS and L1 cache and ICache on their own, that's why. So if current ocl code on the R9 290x runs at 20MH/s then the latest asm code should be run at 70MH/s.

25% compared to 14.6, it's 43% compared to 14.9.
No scratch reg use (when I triggered it a couple times, it slowed down to less than 1 Mh/s).
I'd like to try your asm code myself, but I'd need the linux version of the assembler.

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 12:55:13 PM
 #153

Now I managed to build sgminer5.1 on my sys. I still have to make my kernel to work with it.

Does sgminer has an offline 'diagnostic' mode, just for testing the kernel if it runs and how fast it runs?

"I'd need the linux version of the assembler."
Sorry, it's impossible. It's not even written in Cpp just to be able to compile on any other system, than win.

And to make things more complicated Cheesy You have to compile with it for every type of gcn cards multiplied by every Catalyst driver that was altered by AMD developers. My compiler only patches the binary into the .elf, the actual elf file is generated by the current Catalyst Driver of the currently selected gfx card.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 08, 2015, 01:39:04 PM
 #154

Does sgminer has an offline 'diagnostic' mode, just for testing the kernel if it runs and how fast it runs?

There is a simple "benchmark" option:

--benchmark         Run sgminer in benchmark mode - produces no shares

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 03:43:38 PM
 #155

Unfortunately there is no --benchmark parameter. I checked in in the source code too, but nothing similar https://github.com/sgminer-dev/sgminer/blob/master/sgminer.c.
Is there a simple war to run it? Now I have a groestl wallet, but where can I get username from? What parameters should I use other than -k groestl and -d 1?
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 08, 2015, 03:57:32 PM
 #156

Unfortunately there is no --benchmark parameter. I checked in in the source code too, but nothing similar https://github.com/sgminer-dev/sgminer/blob/master/sgminer.c.
Is there a simple war to run it? Now I have a groestl wallet, but where can I get username from? What parameters should I use other than -k groestl and -d 1?

Probably they removed it, I'm using an older version.
I run it like this, for solo mine:

sgminer -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:GROESTLCOIN_RPC_PORT -u YOURUSER -p YOURPASSWORD

Then you have to find and add your best intensity and worksize (my OS kernel works with 256 only).
username and password are set in groestlcoin.conf; the port you can easily find in their thread (or via netstat).

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 08:08:22 PM
 #157

Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 08, 2015, 08:30:59 PM
 #158

Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?
Are you sure it is running your kernel?.  Look in your sgminer dir, for a .bin file generated by OCL it may be running default groestlcoin OCL.  delete .bin and replace with your own of same name generated, it will not be regenerated it it exists in dir.  you must delete .bin whenever you change configs to force OCL recompile ... but you don't want that, u want to run your asm kernel ... so will have to figure out the parameter passing from sgminer ...
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 08, 2015, 08:41:59 PM
 #159

Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?

Intensity 24 is too much, I'd stay between 20 and 22, otherwise you'll produce a lot of rejected shares (or orphans if solo mining).
The shaders option is ignored for groestl.
The hashrate should be calculated on the full computation, i.e. 2 chained hashes.
What kernel are you using?

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 09:48:35 PM
 #160

I'm using your kernel: groestlcoin.cl.

Now I disassembled a dummy kernel with the appropriate parameters and I forgot about the T buffers. OpenCL uploads them in an extra buffer automatically. I don't even wanna know how the driver send that extra buffer and most importantly can't make an automatic skeleton kernel to get the binary with a placeholder for constant data that my program can patch with the output of the assembler.

So the easiest way would be to modify sgminer to handle my kernel. I have found the the 'queue_sph_kernel()' function where I can start from.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 08, 2015, 09:57:06 PM
 #161

I'm using your kernel: groestlcoin.cl.

Now I disassembled a dummy kernel with the appropriate parameters and I forgot about the T buffers. OpenCL uploads them in an extra buffer automatically. I don't even wanna know how the driver send that extra buffer and most importantly can't make an automatic skeleton kernel to get the binary with a placeholder for constant data that my program can patch with the output of the assembler.

So the easiest way would be to modify sgminer to handle my kernel. I have found the the 'queue_sph_kernel()' function where I can start from.

I never tested my kernel with cards smaller than tahiti, I also have no reports of it running on <= pitcairn: other groestlcoin kernels might be faster in that case.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 09, 2015, 03:55:25 PM
 #162

I see there is very little interest in mining groestl coins with GPU: very few users joined the recent discussion (2/3).
Let alone contributing to the code (2) or donating (2), in the whole life of this thread.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 09, 2015, 04:08:09 PM
 #163

I see there is very little interest in mining groestl coins with GPU: very few users joined the recent discussion (2/3).
Let alone contributing to the code (2) or donating (2), in the whole life of this thread.
Well I still prefer GPU mining while block rwd 1.0 and will see what happens to diff when Rwd drops to 0.1 ... So count me in on new kernel, I donated a bit last time u did new kernel and will donate again for new super-super asm kernel Smiley
I expect diff will drop remarkably when Rwd drops and solo mining might still be attractive even aftre ...
I have 1 280x solo mining DMD (Pallas Diamond) approx 18.6 MHs (2-4 coins per day)
and 7950 solo mining FTC (neoscrypt) 278 KHs (would be sweet if these opt'z could be applied to Neoscrypt also ... wolf0 where are u?)

@realhet
Would be great if you could add a kernel setting parameter (perhaps realhet) that selects using your kernel and supply a windows x64 build of your sgminer ... I'd donate for that Smiley
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 10, 2015, 01:41:26 AM
 #164

Well, I found it better not to alter sgminer, that I'm totally unfamiliar with it and rather started to turn my kernel to be exactly the same as groestlcoin.cl from the outside. It will be a half page of additional code that deals with the kernel parameters. With a small dummy kernel it is already working now, but I'm just too tired to continue now. Cheesy
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 10, 2015, 03:48:37 AM
 #165

Well, I found it better not to alter sgminer, that I'm totally unfamiliar with it and rather started to turn my kernel to be exactly the same as groestlcoin.cl from the outside. It will be a half page of additional code that deals with the kernel parameters. With a small dummy kernel it is already working now, but I'm just too tired to continue now. Cheesy

well that's easier to use for the people.
waiting forward to seeing your progress! :-)

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 11, 2015, 03:01:34 AM
 #166

Do I need a better proof than this? Grin

I'm the proud owner of my first 19 GRS coins, haha. I guess I was super lucky to get an 'accepted' right after 10 minutes of mining.

The speed increase in sgminer is the same that I measured in my 'workbench': From 2MH/s it raised to 7MH/s. (Or if we calculate in GroestlHash/s then it is 4MH/s -> 14MH/s.)

If anyone willing to help me testing this, please tell me! You'll need a Windows with cat14.9 and you also have to brave enough to run my IDE (HetPas.exe) on that system.

I can't wait to see your reports that how fast it is on the big cards. Cheesy
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 11, 2015, 03:24:35 AM
 #167

The compiled bin file should work regardless of catalyst version or operating system, so could you please post a link to the bin file? Thanks.
There is something weird about the sgminer screenshot, are you sure it's working correctly? It shows a single, disabled GPU with id 0, and the share got accepted was from GPU id 1. The diff numbers are also kinda weird.

Wolf0
Legendary
*
Online Online

Activity: 1666


Miner Developer


View Profile
January 11, 2015, 05:23:58 AM
 #168

The compiled bin file should work regardless of catalyst version or operating system, so could you please post a link to the bin file? Thanks.
There is something weird about the sgminer screenshot, are you sure it's working correctly? It shows a single, disabled GPU with id 0, and the share got accepted was from GPU id 1. The diff numbers are also kinda weird.

SG bug.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
qwep1
Hero Member
*****
Online Online

Activity: 490


View Profile
January 11, 2015, 09:21:06 AM
 #169

I would also tested
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 11, 2015, 10:54:26 PM
 #170

Sorry for taking it a bit long.

Here's what all you have to know if you're willing to test: http://realhet.wordpress.com/gcn-asm-groestl-coin-kernel/

Please send me benchmarks and compiled kernels for various cards!

I'm running it for an hour now and I got a 'rejected'. I'm solo mining GRS. Do I need to worry? Or is it usual? Can it be caused by slow network?
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 11, 2015, 10:56:59 PM
 #171

Realhet, thanks for the capeverde bin, unfortunately I can't use it because it's 32 bit.
I created a bootable win7 stick in order to compile the kernel: it compiles fine but, when run, it says "no target Hawaii" and no bin is created.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 11, 2015, 10:58:04 PM
 #172

I'm running it for an hour now and I got a 'rejected'. I'm solo mining GRS. Do I need to worry? Or is it usual? Can it be caused by slow network?

yes it can be cause of the network: if the wallet is behind sync, the block may be rejected (or orphaned).
try with a pool...

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 11, 2015, 11:11:06 PM
 #173

Runtime error: No GCN device found

I have 2 AMD cards on gpu-platform 1
and 1 Intel GPU on gpu-platform 0

Edit: DOH 14.7RC3 not GCN ...
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 11, 2015, 11:47:28 PM
 #174

Thx for testing! So many errors :S But usually that's how it goes.

"No GCN device found" error.

That could be because I can't recognize new cards.
I know only these at the moment.
'TAHITI', 'PITCAIRN', 'CAPEVERDE', 'UNKNOWN5');
Importing new names right now.

Meanwhile you can select an OpenCL device by uncommenting this line in the code:
var dev:=cl.devices[0]; //access device by index (must be a GCN one)

The findDevices function can't recognize new cards. I'll repair it now.

@pallas: Thanks for fiddling with Win7! Cheesy What does it means by 32 bit code? That has no meaning regarding the GCN hardware o.O
But I'm 100% sure that you can't use my Capeverde binary unless you have that chip in the device you selected. ( var dev:=cl.devices[CLdeviceIndex]; )
   
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 12, 2015, 12:19:53 AM
 #175

I've updated HetPas and the groestl_isa.hpas too. Pls download HetPas150111_Groestl.zip.

From now it will start with a list of the cards:
writeln("List of opencl devices:");
for var i:=0 to cl.devices.count-1 do begin
  writeln("Device #",i);
  writeln(cl.devices[ i].dump);
end;

It should display something like this:
List of opencl devices:
Device #0
Target: Cayman  Series: 6  Core:880 MHz  CU:24  RAM:2048 MB  UID:4098
ext: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics ...
Device #1
Target: Capeverde  Series: 7  Core:880 MHz  CU:10  RAM:1024 MB  UID:4098
ext: cl_khr_fp64 cl_amd_fp64 ...

Using device:
Target: Capeverde  Series: 7  Core:880 MHz  CU:10  RAM:1024 MB  UID:4098
ext: cl_khr_fp64 cl_amd_fp64 ...
* core MHz value is not always accurate, use Catalyst Control Center (or ADL) instead!

For the GCN cards, the 'Series' must be at least 7. If it fails and it is indeed a GCN card, then I detected it badly, pls report then. My first card is a series 6xxx Northern Islands hardware, it can't used for this kernel.

@utahjohn: Maybe it works on 14.7 too. I can't tell that, but I know that it will crash on 13.4 because the kernel parameters are handled differently in that driver.
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 12, 2015, 12:22:34 AM
 #176

Temporarily upgraded to 14.9 to run hetpas, built for 280x.
Had hell of a time reverting back to 14.7 ... several tries later 14.7 working again and I have a kernel.elf for 280x.

Testing now ...

Very early results ...
280x I=22 E=1180 M=150 WS=256 ... 26 MHs Solo . No blocks yet ... approx 1.4x normal diamond kernel (18.5MHs)

Intensity 22 is sweet spot for my 280x, now playing with mem clock ...

No significant effect on raising mem-clock other than higher temps ...

stick with low mem clock.
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 12, 2015, 12:56:32 AM
 #177

"Very early results ..."

Very good, that it runs at you!

The speedup is not that impressive but let me ask yo to do a test:

Please when you stop sgminer, press run the groestl_isa.hpas, and copy/paste here my programs output, like this:

-----------------------------------------
Using new GCN ASM code
Kernel binary saved: C:\Work\Groestl\kernel_dump\kernel.elf

elapsed: 190.661 ms  13.749 MH/s   gain:   3.44x
elapsed: 188.444 ms  13.911 MH/s   gain:   3.48x
elapsed: 188.218 ms  13.928 MH/s   gain:   3.48x
elapsed: 188.225 ms  13.927 MH/s   gain:   3.48x

Functional test: RESULT IS OK
-----------------------------------------

And then go to around line 23 and comment out the "#define USE_NEW_ASM_KERNEL" and run it again! This will compile the original OpenCL kernel I've downloaded with sgminer5.1.

-----------------------------------------
Using original OpenCL code
Kernel binary saved: C:\Work\Groestl\kernel_dump\kernel.elf

elapsed: 657.623 ms  3.986 MH/s   gain:   1.00x
elapsed: 655.396 ms  4.000 MH/s   gain:   1.00x
elapsed: 654.897 ms  4.003 MH/s   gain:   1.00x
elapsed: 655.055 ms  4.002 MH/s   gain:   1.00x

Functional test: RESULT IS OK
-----------------------------------------

As you can see, on my small card the speedup is 3.5x. I'd like to check these results on your 280x as well.
I'm thinking that the problem is only because your big card don't get enough threads ore something similar.

Just a silly test: what if you turn Memory clock up to normal speed? Maybe it will change the L1 cache's behaviour? My kernel uses 0 memory, but uses L1 cache extensively.

And finally I had an 'accepted', phew...

"Had hell of a time reverting back to 14.7" -> Is there a tool called "Catalyst Clean Uninstall Utility" nowadays? 2-3 years ago that was useful when decrease Cat version.
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 12, 2015, 01:09:06 AM
 #178

"Very early results ..."

Very good, that it runs at you!

The speedup is not that impressive but let me ask yo to do a test:

Please when you stop sgminer, press run the groestl_isa.hpas, and copy/paste here my programs output, like this:

-----------------------------------------
Using new GCN ASM code
Kernel binary saved: C:\Work\Groestl\kernel_dump\kernel.elf

elapsed: 190.661 ms  13.749 MH/s   gain:   3.44x
elapsed: 188.444 ms  13.911 MH/s   gain:   3.48x
elapsed: 188.218 ms  13.928 MH/s   gain:   3.48x
elapsed: 188.225 ms  13.927 MH/s   gain:   3.48x

Functional test: RESULT IS OK
-----------------------------------------

And then go to around line 23 and comment out the "#define USE_NEW_ASM_KERNEL" and run it again! This will compile the original OpenCL kernel I've downloaded with sgminer5.1.

-----------------------------------------
Using original OpenCL code
Kernel binary saved: C:\Work\Groestl\kernel_dump\kernel.elf

elapsed: 657.623 ms  3.986 MH/s   gain:   1.00x
elapsed: 655.396 ms  4.000 MH/s   gain:   1.00x
elapsed: 654.897 ms  4.003 MH/s   gain:   1.00x
elapsed: 655.055 ms  4.002 MH/s   gain:   1.00x

Functional test: RESULT IS OK
-----------------------------------------

As you can see, on my small card the speedup is 3.5x. I'd like to check these results on your 280x as well.
I'm thinking that the problem is only because your big card don't get enough threads ore something similar.

Just a silly test: what if you turn Memory clock up to normal speed? Maybe it will change the L1 cache's behaviour? My kernel uses 0 memory, but uses L1 cache extensively.

And finally I had an 'accepted', phew...

"Had hell of a time reverting back to 14.7" -> Is there a tool called "Catalyst Clean Uninstall Utility" nowadays? 2-3 years ago that was useful when decrease Cat version.
No significant effect on raising mem-clock other than higher temps ...

Use "DDU" to clean catalyst drivers but not always 100% effective sometimes a little manual cleaning needed too ...

BTW I am using Pallas kernel as reference, not one supplied with stock sgminer ...

Any tweaks you can do with 2048 shaders (280x) and 1792 shaders (7950) ?
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 12, 2015, 01:24:21 AM
 #179

Yes, that is must be the same kernel that I've copied into the groestl directory next to the groestl_isa.hpas file.

When you compile the original kernel within then groestl_isa.hpas program, it will use the groestl_original.cl kernel. It's Pallas's kernel, except that I hardcoded the workgroup size in it, and did another very minor change.

Also I compared the kernel I downloaded from the very first post in this topic: It's the same.
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 12, 2015, 01:29:45 AM
 #180

I did not try running kernel under catalyst 14.9, all I wanted was to generate the kernel.elf to run under 14.7 ... because I run multiple algos concurrently under 14.7 that suffer under 14.9 ...

Also note that I am running sgminer 4.1.0
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 12, 2015, 01:36:18 AM
 #181

I tested my kernel only in Cat 14.9
I have no info on how it works on 14.7

When you compile in HetPas it will generate a skeleton kernel binary with the help of the OpenCL compiler. And then the new assembly code will be PATCHED into that. So I don't make the binary from scratch and maybe the 14.7 binary is a bit different than the 14.9 binary and I just don't know about that. (Although life would be so much easier if AMD would be so kind and give us an interface to upload binary program code... But that's not going to happen Cheesy)


"Any tweaks you can do with..."

Please let's do the test inside the IDE first. Let's compare the original and the new kernel there, as it is perfect for timing. In sgminer we need to play with Intensity and other factors and wait for minutes to get a correct time anyways.

So please paste here what you see on HetPas on the right pane after you run the program:
I'm interested in this information, and also tell me what card and engine MHz you used:

Using new GCN ASM code
Kernel binary saved: C:\Work\Groestl\kernel_dump\kernel.elf

elapsed: 190.645 ms  13.750 MH/s   gain:   3.44x
elapsed: 188.281 ms  13.923 MH/s   gain:   3.48x
elapsed: 188.233 ms  13.927 MH/s   gain:   3.48x
elapsed: 188.316 ms  13.920 MH/s   gain:   3.48x

Functional test: RESULT IS OK

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 12, 2015, 02:04:01 AM
 #182

Thanks!

Well this is kinda bad for a Tahiti :/

Also the times of the 4 kernel launches are weird:
On my card it is 3.44x, 3.48x, 3.48x, 3.48x
But on your card this is 3.88x, 3.10x, 3.10x, 3.10x

On my card the first launch is a bit slow because the card was at low MHz when the test started and after the warmip it became steady 3.48x.

On your card the speeds are so random. Your card (at 1150) is 3.68x faster than mine, so everything is ok, you should have see 12.8x gains.

Maybe it is a 14.7 issue, I don't know. Everything can change from driver to driver...

What is on my mind is:

1. What if you change workcount form the original
    WorkCount := 256*10*512
to WorkCount := 256*10*512*10;  ?
Does elapsed times became are 10x longer?  (Functional test will fail, ot's ok, just reset WorkCount to default value after this test)

2. Let's see how the original kernel works in HetPas:
  just comment out the  "#define USE_NEW_ASM_KERNEL" and let me see the times please. If the original kernel works well, then gain must be 3.68.


(Thank you for testing so far)

--------------------------------------------------------------------
"elapsed: 50.686 ms  51.719 MH/s   gain:  12.93x"
WOW! THIS IS IT! Cheesy:D:D
Exactly what I've expected! Your card is 3.71x faster. What was the error? You accidentally mined while testing, right?
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 12, 2015, 02:15:04 AM
 #183

The last test run I did grabbed 2 cards so divide in half for an average on Tahiti (280x+7950).

Not the gains I was expecting base on you blog ... 3.4x times 18.5 MHs should net me around 62 Mhz vs the 26MHs I'm getting now ... so Tahiti not so great gains but better  Smiley

Short of pulling a card physically I don't know how to disable hetpas running all of them ...
Star65
Member
**
Offline Offline

Activity: 65


View Profile
January 12, 2015, 02:34:42 AM
 #184

I would also tested on 7970 & 280x.
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 12, 2015, 02:47:15 AM
 #185

There must be some missunderstandings based on MHs values. So we have to be careful!

On this topic (first post) when Pallas says that R9 280x is 18MH/s he counts it in Groestl hashes.

When my program says "elapsed: 50.686 ms  51.719 MH/s" it counts it also in Groestl hashes. Just as Pallas.

But when you see MH/s inside sgminer then it must be multiplied by 2 because in SG 1 MH/s = 2 MGroestlH/s.

--------------------------
So when you see "51.719 MH/s" is my program
then you must see 26MH/s in SG.

And when you see 18MH/s on the first post on this topic
You must see 9MH/s in SG.

Also when I see 4MH/s in my program
Then I saw 2MH/s in SG.
---------------------------

So the equation is: 2*sgminer Mh/s = Pallas's Mh/s

This is because sgminer counts 2 Groesth hash calculations as 1. But Pallas count it as 2 hashes, and I just copied Pallas, then later found out how sgminer calculates.

---------------------------
So the Tahiti 26MH/s in sgminer is correct. Please remove the kernel and let sgminer compile it form opencl! If I'm calculating well, then you must see 7-8MH/s with the original kernel. Can you check it please?


utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 12, 2015, 02:52:13 AM
 #186

When I run Pallas OCL I see 18.5MHs in sgminer.
When I run Realhet asm I see 26.0MHs in sgminer.
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 12, 2015, 03:11:14 AM
 #187

When I run Pallas OCL I see 18.5MHs in sgminer.
When I run Realhet asm I see 26.0MHs in sgminer.

Please send me that .cl file and the binary that is compiled by the sgminer, I gotta check it.

For today, Thank You for testing, I gotta sleep now, see you!
Star65
Member
**
Offline Offline

Activity: 65


View Profile
January 12, 2015, 04:33:54 AM
 #188

TVM Pallas and realhet for nice work!

7970/280x 1130/300 W7

Pallas kernel in Cat 14.6  - 17.8MH/s
Pallas kernel in Cat 14.9  - 7.8MH/s   - so 14.9 very bad drivers?!
Realhet kernel in Cat 14.9 - 24.8MH/s - 24.8/7.8=3.18x !!!

We need realhet kernel (bin) with Cat 14.6 or 14.7 (best drivers perhaps). But I do not know how to do it.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 12, 2015, 05:03:14 AM
 #189

14.9 has a piss poor OCL compiler, we've known this for a long time ... Stick with 14.7RC3 for best overall performance over many different algo's.

I guess we are stuck with compiling realhet asm on 14.9 but 14.7 does better compiles for OCL.

I am running realhet asm kernel generated with 14.9 on 14.7 catalyst, just a pain in the ass reverting to 14.7 after using 14.9.

My Pallas OCL compile was done with 14.7RC3 and works better than OCL compiled on 14.9.
Pallas ocl compiled with 14.7RC3 will run normal on 14.9, just don't re-compile it with 14.9 ...

Confused yet? hehe

@Realhet
So the gain of Realhet = 1.40x Pallas stands when comparing to properly working Pallas OCL kernel on 14.7
(Same clocks and Intensity running under 14.7 so a fair compare).
Your Pallas reference speed is incorrect in hetpas because 14.9 mangled the OCL badly performance wise.
Take a look at performance hit 14.7 vs 14.9 in Star65 post above.
Unfortunately some of the "gains" you made may have been just repairing 14.9 OCL bugs LOL but obviously improvement was made somewhere in asm kernel.
You need to establish a baseline for your GPU using 14.7 Pallas OCL and see what really made improvements ...
I suggest start over and use this first round a learning experience Smiley  You started with code broken by 14.9 compiler as a base ...

Pallas 14.7 OCL Bin for 280x 18.5 MHs
https://mega.co.nz/#!kAEnDATC!HeelwXTHDsQNx8WJhTDcwqS-slOmikoBiMqTEK9-DV0
Realhet 14.9 ASM bin for 280x 26.0 MHs
https://mega.co.nz/#!1NlRhYLC!7oLFfr2umL7T2Lc0fX3HY1ddthbpNqt6I_tYdG9OI9g

Another random thought Smiley Can you set hetpas up to "cross-compile" for diff GCN architectures so all we have to do is DL bin files from u to test them?  I really dislike uninst-inst-uninst-inst to try a new asm version on 14.7 ... For example have it compile Tahiti.elf, hawaii.elf etc.  I understand u can only test for your card but with us out here to test other elf would speed process of testing new versions ...

DMD Donations : dJrhv4Pp1FXPrQiEp5njx42QrZiuZrbjQ1

Block found and accepted  solo mining so your asm kernel appears to be valid Smiley

I'd like you to have a look see what you can do to further improve wolf0's neoscrypt kernel with asm when you get time.
7950 currently doing 278KHs mining FTC.  PM me for OCL and BIN.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 09:18:29 AM
 #190

"when Pallas says that R9 280x is 18MH/s he counts it in Groestl hashes."

no my hashrates are taken from sgminer.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 09:20:38 AM
 #191

@pallas: Thanks for fiddling with Win7! Cheesy What does it means by 32 bit code? That has no meaning regarding the GCN hardware o.O
But I'm 100% sure that you can't use my Capeverde binary unless you have that chip in the device you selected. ( var dev:=cl.devices[CLdeviceIndex]; )

Bins generated by sgminer on a 32 bit system will not work on a 64 bit one and viceversa, so I suppose the same is true for your kernels.

sp_
Legendary
*
Offline Offline

Activity: 1134

Ccminer developer


View Profile
January 12, 2015, 09:25:04 AM
 #192

@pallas: Thanks for fiddling with Win7! Cheesy What does it means by 32 bit code? That has no meaning regarding the GCN hardware o.O
But I'm 100% sure that you can't use my Capeverde binary unless you have that chip in the device you selected. ( var dev:=cl.devices[CLdeviceIndex]; )
Bins generated by sgminer on a 32 bit system will not work on a 64 bit one and viceversa, so I suppose the same is true for your kernels.

On linux yes, but on windows they work. You need to run the x86 build of sgminer.

BTC: 1CTiNJyoUmbdMRACtteRWXhGqtSETYd6Vd
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 09:26:25 AM
 #193

@pallas: Thanks for fiddling with Win7! Cheesy What does it means by 32 bit code? That has no meaning regarding the GCN hardware o.O
But I'm 100% sure that you can't use my Capeverde binary unless you have that chip in the device you selected. ( var dev:=cl.devices[CLdeviceIndex]; )

Bins generated by sgminer on a 32 bit system will not work on a 64 bit one and viceversa, so I suppose the same is true for your kernels.

infact:

[10:25:27] Internal error: Input OpenCL binary is not for the target!

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 09:30:29 AM
 #194

@pallas: Thanks for fiddling with Win7! Cheesy What does it means by 32 bit code? That has no meaning regarding the GCN hardware o.O
But I'm 100% sure that you can't use my Capeverde binary unless you have that chip in the device you selected. ( var dev:=cl.devices[CLdeviceIndex]; )

Bins generated by sgminer on a 32 bit system will not work on a 64 bit one and viceversa, so I suppose the same is true for your kernels.
Min end in l4.bin ... am I 32 or 64 ... (win 7 x64)

4 * 8 (bits) = 32

it's the size of a long integer.
probably the sgminer build you are using is 32 bit.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 10:04:38 AM
 #195

@pallas: Thanks for fiddling with Win7! Cheesy What does it means by 32 bit code? That has no meaning regarding the GCN hardware o.O
But I'm 100% sure that you can't use my Capeverde binary unless you have that chip in the device you selected. ( var dev:=cl.devices[CLdeviceIndex]; )

Bins generated by sgminer on a 32 bit system will not work on a 64 bit one and viceversa, so I suppose the same is true for your kernels.
Min end in l4.bin ... am I 32 or 64 ... (win 7 x64)

4 * 8 (bits) = 32

it's the size of a long integer.
probably the sgminer build you are using is 32 bit.
question is does hetpas use 32 or 64 bit ... I'd assume 32 bit since it runs ok on my sgminer ...
my sgminer is old 4.1.0 ...

so you main prob is needing hetpas src to run on linux ...

Probably realhet coded it for 32 bit; I don't know what changes, maybe the parameter passing part.
I hope realhet has time to look into this.
I also use version 4.1.
Hetpas can't run on linux: I'll try again with the new version when I can access my workstation and make it boot on windows.

JuanHungLo
Hero Member
*****
Offline Offline

Activity: 739


I don't always drink...


View Profile
January 12, 2015, 12:30:42 PM
 #196

I built my bins with Wolf0's x64 miner.  Works perfectly.

I'm glad I'm not judgmental like all you smug, superficial idiots
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 12:37:57 PM
 #197

I built my bins with Wolf0's x64 miner.  Works perfectly.

could you share your bin files please?

JuanHungLo
Hero Member
*****
Offline Offline

Activity: 739


I don't always drink...


View Profile
January 12, 2015, 01:37:52 PM
 #198

I built my bins with Wolf0's x64 miner.  Works perfectly.

could you share your bin files please?

Personally, I wouldn't download this.  I'd generate my own.  But here it is.  Use at your own risk!
http://ge.tt/2uga0R82/v/0?c

I'm glad I'm not judgmental like all you smug, superficial idiots
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 01:41:34 PM
 #199

I built my bins with Wolf0's x64 miner.  Works perfectly.

could you share your bin files please?

Personally, I wouldn't download this.  I'd generate my own.  But here it is.  Use at your own risk!
http://ge.tt/2uga0R82/v/0?c

Thanks, but it's 32 bit, I need 64 bit.

pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 01:51:21 PM
 #200

HOW TO TELL IF AN SGMINER BIN FILE IS 32 OR 64 BIT

If the filename, generated by sgminer, ends in l4.bin it is 32 bit (8 x 4 = 32)
If the filename, generated by sgminer, ends in l8.bin it is 64 bit (8 x 8 = 64)

They are incompatible.

Star65
Member
**
Offline Offline

Activity: 65


View Profile
January 12, 2015, 02:50:10 PM
 #201

Guys! We do not need more optimization! If all we get a faster kernel, then the difficulty will increase proportionally. Accordingly, we will not get more coins, but will pay more for electricity. Profits will only decrease.  Sad
Faster kernel good for dev only (as a reward for their hard work), i think so.
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 02:53:17 PM
 #202

Guys! We do not need more optimization! If all we get a faster kernel, then the difficulty will increase proportionally. Accordingly, we will not get more coins, but will pay more for electricity. Profits will only decrease.  Sad
Faster kernel good for dev only (as a reward for their hard work), i think so.

true....
until you have half the hashpower by a couple fpga miners (or so they say) ;-)

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 12, 2015, 02:55:41 PM
 #203

Guys! We do not need more optimization! If all we get a faster kernel, then the difficulty will increase proportionally. Accordingly, we will not get more coins, but will pay more for electricity. Profits will only decrease.  Sad
Faster kernel good for dev only (as a reward for their hard work), i think so.
Not everyone will use new kernel so there is an advantage.  Yes diff will go up some.  Also as diff goes up many miners will drop like dead flies, so It will even out ...
Tell all your friends to Cloudmine/Multipool mine  and stop direct mining, this will lower diff for diehard solo miners Smiley

3 blocks DMD since I started ASM kernel last night ... Smiley
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 03:50:07 PM
 #204

@Pallas
It is extremely rare for me to see any orphan when solo mining so I would venture to guess your network is too slow.

probably too few nodes nearby: I have 20/30 msec round trip time to big internet nodes in my country.
having few fast nodes nearby means my blocks take a lot of time to spread thru the diamond network.
or a lot of bad luck Cheesy

qwep1
Hero Member
*****
Online Online

Activity: 490


View Profile
January 12, 2015, 04:46:17 PM
 #205

Target: Tahiti  Series: 7  Core:1100 MHz  CU:32  RAM:3072 MB  UID:4098
ext: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event
* core MHz value is not always accurate, use Catalyst Control Center (ADL) instead!

elapsed: 69.778 ms  37.568 MH/s   gain:   9.39x
elapsed: 54.247 ms  48.324 MH/s   gain:  12.08x
elapsed: 54.269 ms  48.305 MH/s   gain:  12.08x
elapsed: 54.236 ms  48.334 MH/s   gain:  12.08x
############### RESULT IS WRONG ###################
   idx        hi       lo           hi           lo
     0: 00000000 00000000            0            0
     1: 00000000 00000000            0            0
     2: 00000000 00000000            0            0
     3: 00000000 00000000            0            0
     4: 00000000 00000000            0            0
     5: 00000000 00000000            0            0
     6: 00000000 00000000            0            0
     7: 00000000 00000000            0            0
     8: 00000000 00000000            0            0
     9: 00000000 00000000            0            0
     A: 00000000 00000000            0            0
     B: 00000000 00000000            0            0
     C: 00000000 00000000            0            0
     D: 00000000 00000000            0            0
     E: 00000000 00000000            0            0
     F: 00000000 00000000            0            0
    10: A9A41A9D 9337706F  -1448863075  -1825083281
    11: 370D1AF4 DD743586    923605748   -579586682
    12: CB7EB389 EADF9917   -880888951   -354445033
    13: 25FA6A42 76EDCD1E    637168194   1995296030
    14: 91783455 C7EE8F10  -1854393259   -940667120
    15: F60C362A FD9AFAB3   -166971862    -40174925
    16: 038C0C0F D2E4564F     59509775   -756787633
    17: EA28DD29 3A1B41CA   -366420695    974864842
    18: 708C1E9A DFCDC04F   1888231066   -540164017
    19: 00000000 A7B76679            0  -1481152903
    1A: 00000000 00000000            0            0
    1B: 00000000 00000000            0            0
    1C: 00000000 00000000            0            0
    1D: 00000000 00000000            0            0
    1E: 00000000 00000000            0            0
    1F: 00000000 00000000            0            0
this is normal or am I doing something wrong

Quote
do not get me compile a file
physixz
Newbie
*
Offline Offline

Activity: 13


View Profile
January 12, 2015, 06:59:14 PM
 #206

Whats the best driver version to use as i can only get 11MH/s from my R9 290
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 07:00:26 PM
 #207

Whats the best driver version to use as i can only get 11MH/s from my R9 290

14.6b or 14.7

Or use the precompiled binary.

qwep1
Hero Member
*****
Online Online

Activity: 490


View Profile
January 12, 2015, 07:30:14 PM
 #208

where there is a folder kernel_dump\  ???I can not find
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 12, 2015, 08:12:03 PM
 #209

@realhet
OK a few things I have discovered:
1. Hetpas does compile and run ok on 14.7RC3.
    So no need to install 14.9 Smiley
2. Test Runs:
Target: Tahiti  core:1150 MHz  cu:32  ram:3072 MB  uid:4098
ext: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event
* core MHz value is not always accurate, use Catalyst Control Center (or ADL) instead!

Using original OpenCL code
Kernel binary saved: C:\Miners\HetPas150111_Groestl\groestl\kernel_dump\kernel.elf

elapsed: 72.626 ms  36.095 MH/s   gain:   9.02x
elapsed: 70.712 ms  37.072 MH/s   gain:   9.27x
elapsed: 70.718 ms  37.069 MH/s   gain:   9.27x
elapsed: 70.741 ms  37.057 MH/s   gain:   9.26x

Functional test: RESULT IS OK

Target: Tahiti  core:1150 MHz  cu:32  ram:3072 MB  uid:4098
ext: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event
* core MHz value is not always accurate, use Catalyst Control Center (or ADL) instead!

Using new GCN ASM code
Kernel binary saved: C:\Miners\HetPas150111_Groestl\groestl\kernel_dump\kernel.elf

elapsed: 53.629 ms  48.881 MH/s   gain:  12.22x
elapsed: 50.666 ms  51.740 MH/s   gain:  12.93x
elapsed: 50.677 ms  51.729 MH/s   gain:  12.93x
elapsed: 50.660 ms  51.746 MH/s   gain:  12.94x

Functional test: RESULT IS OK

3. Calculated speed gain is close to actual speed gain of 1.40x as shown running sgminer Smiley

4. First run of OCL should be reference value of 1.0x to do proper comparison, this needs to be reset in hetpas for each architecture.

5. Your timing calculations appear to be wrong.  Single 280x OCL is 18.5MHs, Single 280x ASM is 26.0MHs.
    Are you sure hetpas is not using BOTH of the cards in my test box when running tests?  I am mining in sgminer with SINGLE card, other is turned off and used in another instance of sgminer mining neoscrypt ...
pallas
Legendary
*
Offline Offline

Activity: 1386


Black Belt Developer


View Profile
January 12, 2015, 09:35:34 PM
 #210

The new version compiles fine, but of the two GPUs only id 1 works, id 0 doesn't produce any valid work unit.
Speed: r9 290 30Mh/s, r9 290x 33Mh/s (1100 MHz)
My experimental opencl kernel is a couple percent faster.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 12, 2015, 09:50:34 PM
 #211

The new version compiles fine, but of the two GPUs only id 1 works, id 0 doesn't produce any valid work unit.
Speed: r9 290 30Mh/s, r9 290x 33Mh/s (1100 MHz)
My experimental opencl kernel is a couple percent faster.
care to share newest incarnation of OCL ? PM me a link for personal use only Smiley Smiley
physixz
Newbie
*
Offline Offline

Activity: 13


View Profile
January 12, 2015, 11:43:48 PM
 #212

when i run HetPas it doesn't detect the graphics cards even though im running 14.9 drivers. anybody know why?

i get either Runtime error: openCL error: CL_Device_not_found or no GCN device found when i re-enable the intel integrated graphics. i am running 3 R9 290's and ive tried 14.9 and 14.12 beta drivers and neither work
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 13, 2015, 12:46:01 AM
 #213

when i run HetPas it doesn't detect the graphics cards even though im running 14.9 drivers. anybody know why?

i get either Runtime error: openCL error: CL_Device_not_found or no GCN device found when i re-enable the intel integrated graphics. i am running 3 R9 290's and ive tried 14.9 and 14.12 beta drivers and neither work
I had to disable intel onboard graphics. uninstall all drivers and reinstall 14.7RC3.  What is happening is your AMD cards are being on wrong gpu-platform 1 in my case and Intel was gpu-platform 0.
Hetpas appears to be looking only on gpu-platform 0
completely uninstall all display drivers with DDU and then go to BIOS and disable onboard intel.  When AMD cards redetect they will appear on gpu-platform 0

AVOID 14.9 like the plague, it's OCL compiler is retarded.

physixz
Newbie
*
Offline Offline

Activity: 13


View Profile
January 13, 2015, 01:27:09 AM
 #214

Thanks, I wasnt disabling the intel in UEFI which was my problem. its working now at 26.5MH/s per card which is amazing.
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 13, 2015, 01:35:36 AM
 #215

Thanks, I wasnt disabling the intel in UEFI which was my problem. its working now at 26.5MH/s per card which is amazing.
Pallas is getting 30MHs on 290 with realhet ASM kernel ... so some further tuning now, play with intensity, gpu clock, drop mem clock to lowest possible (150 on my 280x).
JuanHungLo
Hero Member
*****
Offline Offline

Activity: 739


I don't always drink...


View Profile
January 13, 2015, 01:43:17 AM
 #216

Thanks, I wasnt disabling the intel in UEFI which was my problem. its working now at 26.5MH/s per card which is amazing.
I'm using 14.7r3, xI 2048, 1100/150, -w 256 undervolted to 1.00 and getting 23.38 MH/s.  What's your config?

I'm glad I'm not judgmental like all you smug, superficial idiots
utahjohn
Hero Member
*****
Offline Offline

Activity: 630


View Profile
January 13, 2015, 01:43:20 AM
 #217

Note Intel GPU can be used for other algo such as X11, neoscrypt.  Now that AMD is on gpu-platform 0, you can try re-enable intel and see if it will pop up on gpu-platform 1.
check with sgminer -n in a command prompt window.
to specify which platform to use on sgminer command line --gpu-platform 0, or 1 ...
display # counts still start from 0 on each gpu-platform.

I have heard of ppl also running nvidia cards in same box with AMD, yet another gpu-platform selection ... Smiley
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 13, 2015, 02:28:13 AM
 #218

Hi All,

Important things to the top:
* I slightly updated the HetPas150111_Groestl.zip -> MH/s values are now the same as in SG.
* I've updated the main page with benchmark data I've collected: http://realhet.wordpress.com/gcn-asm-groestl-coin-kernel/
* I've uploaded the diamondTahiti binary, so now there are 2 precompiled bins, thank you utahjohn!


* My MH/s missunderstanding.
Thany you all for the investigations, now I see it clearly.
When I tried the groestlcoin.cl on my card on 14.9 and it ran on 2 MH/s. If I convert the 25MH/s from R9 290 down to my HD7770, then I should have got 4 MH/s.
And here comes my bad decision:
I didn't believed that the 14.6->14.9 changes were so bad that they slowed the kernel more than 2x. Actually it was 2.6x slower than my expectations.
And because the algorithm contains technically 2 hash calculations I thought that multiplying by 2 gives me the correct MH/s.
But as it turned out they indeed broke 14.9 so badly.
So If I ever thought about hating ocl, now I hate it more than twice. To be precise I hate it 2.6x more. Cheesy
But on the optimistic side because of 14.9 made an exceptional quality of cr4p out of the ocl kernel, that gave me the false feel of success to continue optimizing, haha.
Anyways, I'm happy that it is solved now.


* HetPas and Catalyst version
When you compile an ASM kernel, my compiler generates a pure binary (and some parameters eg. LDS size)
In order to make it run it have to generate a complicated ELF binary image, so it will ask for one from the OpenCL compiler.
This small skeleton kernel contains the kernel parameters that you request in the assembly source.
For this groestl kernel I supply it a special skeleton.cl (see below in this post).
So when CpenCL compiled this small skeleton kernel, my program will patch the binary and other parameters into is. Also cut out every unwanted parts such as ocl, llwmir, amd_il sections. There is even a few kilobytes of zeroes in the ELF just to be compatible with terribly old hardware, I cut that out too.
And because I use the current OpenCL system, that's why the