Bitcoin Forum
January 24, 2021, 11:07:25 PM
 News: Latest Bitcoin Core release: 0.21.0 [Torrent]
 Home Help Search Login Register More
 Pages: 1 2 3 4 [5] 6 7 8 9 10 11  All
 Author Topic: My initial Radeon HD 7970 mining benchmarks  (Read 46675 times)
DiabloD3
Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

 January 09, 2012, 08:30:21 PM

Wait wait wait. Are we sure uint16 is such a good idea? Last time I tried >4 (which was before 2.6, btw, I haven't tested with 2.6), it would crash in the compiler. Also, does anyone have a count on the number of registers per CU? There might not be enough registers to handle that.

I'm not sure if it's a good idea or not so I wanted to measure it GCN has 64KB worth of registers per CU, and like you said I'm not sure if that's enough. The reason for my curiosity was because GCN's compute units each contain 4 x SIMD units with a width of 16 elements (same size as Larrabee & Intel's MIC, coincidentally), and I recall reading somewhere that each of these SIMD units can retire one 16-way instruction every 4 cycles, so those 16element vectors kind of rang out at me. I also wanted to get familiar with the OpenCL bitcoin mining code and thought it would be a neat exercise (which it was!). Nice code by the way.

I can say for sure that 16element vectors DO compile with the drivers that came with the card.

The -ds code dump for 16 element vectors came out nice and clean, although the last few lines where the result is stored in output seem a bit branchy. It looks something like this:

Code:
if(XG2.s0 == 0x136032ED) { output[Xnonce.s0 & 0xF] = Xnonce.s0; }
if(XG2.s1 == 0x136032ED) { output[Xnonce.s1 & 0xF] = Xnonce.s1; }
if(XG2.s2 == 0x136032ED) { output[Xnonce.s2 & 0xF] = Xnonce.s2; }
...
...
if(XG2.sd == 0x136032ED) { output[Xnonce.sd & 0xF] = Xnonce.sd; }
if(XG2.se == 0x136032ED) { output[Xnonce.se & 0xF] = Xnonce.se; }
if(XG2.sf == 0x136032ED) { output[Xnonce.sf & 0xF] = Xnonce.sf; }

I tried replacing it with a branch-less expression using shuffle() and vstore16() but haven't managed to get it working. What I've come up with looks something like this:

Code:
x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);

Anyhow I'm sure that my code modifications are doing all sorts of dumb things. I'm still learning how it all works so please ignore.

Also, check some of the larger -vs, -v 40 is two sets of uint4 and -v 44 does three uint4s (unlike cgminer, -v 4 does two uint2s).

I've tried all of the different -v settings available (according to the source) but haven't been able to get any higher than the 666MH/s with the default settings and 3 compute threads.

The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).

I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

1611529645
Hero Member

Offline

Posts: 1611529645

Ignore
 1611529645

1611529645
 Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1611529645
Hero Member

Offline

Posts: 1611529645

Ignore
 1611529645

1611529645
 Report to moderator
1611529645
Hero Member

Offline

Posts: 1611529645

Ignore
 1611529645

1611529645
 Report to moderator
1611529645
Hero Member

Offline

Posts: 1611529645

Ignore
 1611529645

1611529645
 Report to moderator
Newbie

Offline

Activity: 8
Merit: 0

 January 09, 2012, 08:32:05 PM

Wait wait wait. Are we sure uint16 is such a good idea? Last time I tried >4 (which was before 2.6, btw, I haven't tested with 2.6), it would crash in the compiler. Also, does anyone have a count on the number of registers per CU? There might not be enough registers to handle that.

I'm not sure if it's a good idea or not so I wanted to measure it GCN has 64KB worth of registers per CU, and like you said I'm not sure if that's enough. The reason for my curiosity was because GCN's compute units each contain 4 x SIMD units with a width of 16 elements (same size as Larrabee & Intel's MIC, coincidentally), and I recall reading somewhere that each of these SIMD units can retire one 16-way instruction every 4 cycles, so those 16element vectors kind of rang out at me. I also wanted to get familiar with the OpenCL bitcoin mining code and thought it would be a neat exercise (which it was!). Nice code by the way.

I can say for sure that 16element vectors DO compile with the drivers that came with the card.

The -ds code dump for 16 element vectors came out nice and clean, although the last few lines where the result is stored in output seem a bit branchy. It looks something like this:

Code:
if(XG2.s0 == 0x136032ED) { output[Xnonce.s0 & 0xF] = Xnonce.s0; }
if(XG2.s1 == 0x136032ED) { output[Xnonce.s1 & 0xF] = Xnonce.s1; }
if(XG2.s2 == 0x136032ED) { output[Xnonce.s2 & 0xF] = Xnonce.s2; }
...
...
if(XG2.sd == 0x136032ED) { output[Xnonce.sd & 0xF] = Xnonce.sd; }
if(XG2.se == 0x136032ED) { output[Xnonce.se & 0xF] = Xnonce.se; }
if(XG2.sf == 0x136032ED) { output[Xnonce.sf & 0xF] = Xnonce.sf; }

I tried replacing it with a branch-less expression using shuffle() and vstore16() but haven't managed to get it working. What I've come up with looks something like this:

Code:
x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);

Anyhow I'm sure that my code modifications are doing all sorts of dumb things. I'm still learning how it all works so please ignore.

Also, check some of the larger -vs, -v 40 is two sets of uint4 and -v 44 does three uint4s (unlike cgminer, -v 4 does two uint2s).

I've tried all of the different -v settings available (according to the source) but haven't been able to get any higher than the 666MH/s with the default settings and 3 compute threads.

The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).

I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

So does that mean that is the best for 5870 cards ? Or stick to 2.1 or 2.4 ? I am quite confused as to what the best SDK / ati driver combo is ATM.
Legendary

Offline

Activity: 1428
Merit: 1000

Okey Dokey Lokey

 January 09, 2012, 08:36:27 PM

okay, i will fly to Singapore and pick one up if it all makes you happy....

i got a girl there:P
Is it Mrs. Zhou Tong?

"Aadamm, what the hell did you DO?, The whole buildings on alert!" "A PANIC ROOM SHES GOD A GODDAMN PANIC ROOM!" "YEA WELL SO DO I ADAM!!"

http://bitcoin-otc.com/viewratingdetail.php?nick=DingoRabiit&sign=ANY&type=RECV <-My Ratings
https://bitcointalk.org/index.php?topic=857670.0 GAWminers and associated things are not to be trusted, Especially the "mineral" exchange
DiabloD3
Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

 January 09, 2012, 09:07:48 PMLast edit: January 09, 2012, 09:30:35 PM by DiabloD3

Quote from: DiabloD3
I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

So does that mean that is the best for 5870 cards ? Or stick to 2.1 or 2.4 ? I am quite confused as to what the best SDK / ati driver combo is ATM.

Notice I said CPU not GPU. CPU mining still sucks altogether. 2.1 is still best for 58xx cards.

1onevvolf
Newbie

Offline

Activity: 43
Merit: 0

 January 09, 2012, 10:59:29 PM

The branching has ended up becoming the best outcome. It can evaluate those branches in parallel, and you can't easily optimize away branches for memory writes (and theres apparently like 2 or 3 good tricks to get rid of branch waste, its just none of them work on memory writes).

That's interesting. Branches always seemed to be published as anathema to well performing kernels. I guess it all depends on how much work is being done inside. For small vector sizes there are few ifs, but with uint16 there are quite a few, so it might be worth investigating there.

I should look at shuffle. Your way doesn't quite work though, vstore would output H !=0 hashes, which would trigger HW error alerts (and rightfully so) in the host code, and I consider the HW error tracking important. At least, assuming I'm reading that code right, anyways.

Yes, I'm still getting HW alerts and haven't quite worked them out yet. I posted the snippet earlier from memory and missed a couple of steps. The latest (broken) code I'm working with looks like this:

Code:
int16 selection = XG2 == (x)(0x136032ED);
if (any(selection))
{
x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);
}

That "if" might be totally unneccesary, and I still don't quite understand how the output array works, but it might give you a better idea of what I was trying to do to avoid all those branches.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

I'll be watching the repository then It should almost definitely help with more modern CPUs and Larrabee/Intel MIC.
DiabloD3
Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

 January 09, 2012, 11:24:37 PM

Code:
int16 selection = XG2 == (x)(0x136032ED);
if (any(selection))
{
x mask = Xnonce & 0xF;
x temp = shuffle(select(Xnonce, 0, selection), mask);
vstore16(temp, 0, output);
}

That "if" might be totally unneccesary, and I still don't quite understand how the output array works, but it might give you a better idea of what I was trying to do to avoid all those branches.

I'll go add official 8 and 16 wide support in a bit, should be useful on, say, AVX if you manually enable CPU mining in the code. SDK 2.6's cpu compiler apparently has gotten a lot better from what I've heard.

I'll be watching the repository then It should almost definitely help with more modern CPUs and Larrabee/Intel MIC.

The output array is basically a massive hack to prevent multiple outputs from hitting each other, although the chances of getting multiple outputs is extremely low. The size of the array now is massive overkill, but it also seems to be a strangely optimum size for hardware.

Now, what would give me the most benefit is some way of sorting the outputs in a single cycle so that the pair of { nonce, H } could instantly give me the best nonce, and then only evaluate that. There seems to be no way to do this (and yes, I imply reverting that one bit of math so that H == 0 is literally done at the end again, makes it much easier to sort on shit). The nonces themselves can't be sorted because its completely random, they're meaningless values essentially.

celcoid
Member

Offline

Activity: 280
Merit: 10

 January 10, 2012, 02:09:10 AM

So what do you think we can get out of this card being 100% optimistic? How much is it limited by the current best software solution.
chromeguy
Newbie

Offline

Activity: 28
Merit: 0

 January 10, 2012, 03:25:38 AM

666+mh/s
some hardcore guys used liquid nitrogen cooling and overclocked it by 84%
1onevvolf
Newbie

Offline

Activity: 43
Merit: 0

 January 10, 2012, 01:17:31 PM

Up to 670MH/s @ 1125/975Mhz now with the driver AMD published yesterday! And I finally managed to find a wattmeter, so you can expect some measurements later today when I get back home from work
EPiSKiNG
Legendary

Offline

Activity: 801
Merit: 1001

 January 10, 2012, 04:05:25 PM

Up to 670MH/s @ 1125/975Mhz now with the driver AMD published yesterday! And I finally managed to find a wattmeter, so you can expect some measurements later today when I get back home from work

Yay!!

YOU CAN TRUST ME! EPiSKiNG-'s COINS!! BUYING / SELLING BTC - USA --- View my OTC Trading Feedback!!
<gribble> You are identified as user EPiSKiNG-, with GPG key id 721730127CD7574D, key fingerprint EBFC267F8F10EFD1FB84854D721730127CD7574D, and bitcoin address 1EPiSKiNG139bzcwTm8rxMFNfFFdanLW5K
1onevvolf
Newbie

Offline

Activity: 43
Merit: 0

 January 10, 2012, 10:04:53 PM

I've measured my system and these are the results:

 Stock (925/1375MHz) Overclocked (1125/975MHz) Mining                        : 371 W @ 550MH/s 385 W @ 670MH/s Idle                            : 118 W 118 W Difference_(gfx_card_W): 253 W 267 W MH/J_(system)             : 1.48 1.74 MH/J_(gfx_card_only)    : 2.17 2.51 MH/\$_(gfx_card_only)   : 1.00 1.22

(MH/\$ estimated using lowest listed price for HD 7970 on amazon.com today)
EPiSKiNG
Legendary

Offline

Activity: 801
Merit: 1001

 January 10, 2012, 10:28:49 PM

I've measured my system and these are the results:

 Stock (925/1375MHz) Overclocked (1125/975MHz) Mining                        : 371 W @ 550MH/s 385 W @ 670MH/s Idle                            : 118 W 118 W Difference_(gfx_card_W): 253 W 267 W MH/J_(system)             : 1.48 1.74 MH/J_(gfx_card_only)    : 2.17 2.51 MH/\$_(gfx_card_only)   : 1.00 1.22

(MH/\$ estimated using lowest listed price for HD 7970 on amazon.com today)

Excellent findings!  Thank you for your candor!

-EP

YOU CAN TRUST ME! EPiSKiNG-'s COINS!! BUYING / SELLING BTC - USA --- View my OTC Trading Feedback!!
<gribble> You are identified as user EPiSKiNG-, with GPG key id 721730127CD7574D, key fingerprint EBFC267F8F10EFD1FB84854D721730127CD7574D, and bitcoin address 1EPiSKiNG139bzcwTm8rxMFNfFFdanLW5K
DeathAndTaxes
Donator
Legendary

Offline

Activity: 1218
Merit: 1008

Gerald Davis

 January 10, 2012, 10:30:04 PM

I've measured my system and these are the results:

 Stock (925/1375MHz) Overclocked (1125/975MHz) Mining                        : 371 W @ 550MH/s 385 W @ 670MH/s Idle                            : 118 W 118 W Difference_(gfx_card_W): 253 W 267 W MH/J_(system)             : 1.48 1.74 MH/J_(gfx_card_only)    : 2.17 2.51 MH/\$_(gfx_card_only)   : 1.00 1.22

(MH/\$ estimated using lowest listed price for HD 7970 on amazon.com today)

Thanks for putting that all together.  For lack of a better term.... brutally bad.

Significant reduction in MH/W compared to 5000 series.

EPiSKiNG
Legendary

Offline

Activity: 801
Merit: 1001

 January 10, 2012, 10:32:33 PM

anyone care to update https://en.bitcoin.it/wiki/Mining_hardware_comparison  with the new findings??

-EP

YOU CAN TRUST ME! EPiSKiNG-'s COINS!! BUYING / SELLING BTC - USA --- View my OTC Trading Feedback!!
<gribble> You are identified as user EPiSKiNG-, with GPG key id 721730127CD7574D, key fingerprint EBFC267F8F10EFD1FB84854D721730127CD7574D, and bitcoin address 1EPiSKiNG139bzcwTm8rxMFNfFFdanLW5K
DiabloD3
Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

 January 10, 2012, 11:04:52 PM

I've measured my system and these are the results:

 Stock (925/1375MHz) Overclocked (1125/975MHz) Mining                        : 371 W @ 550MH/s 385 W @ 670MH/s Idle                            : 118 W 118 W Difference_(gfx_card_W): 253 W 267 W MH/J_(system)             : 1.48 1.74 MH/J_(gfx_card_only)    : 2.17 2.51 MH/\$_(gfx_card_only)   : 1.00 1.22

(MH/\$ estimated using lowest listed price for HD 7970 on amazon.com today)

Thanks for putting that all together.  For lack of a better term.... brutally bad.

Significant reduction in MH/W compared to 5000 series.

Not entirely. Remember, this card will need significant optimizations, and don't apple/oranges vs 58xx if you're not using the same SDK. Nothing is going to beat 58xx on SDK 2.1, and you shouldn't expect anything that glorious ever again. Its a classic. That said, SDK 2.5 on 58xx, you lose about 4-5% give or take, dunno about 2.6, still haven't quite figured out how to best fix that.

1862
Newbie

Offline

Activity: 65
Merit: 0

 January 10, 2012, 11:11:19 PM

I can't wait till the 7990 that is going to be impressive but expensive  I might have missed this but what is the heat like hashing overclocked ? and what fan speed

Overclocked @ 1125/975MHz with automatic fan speed I'm getting temperatures hovering 81-83C, and the fan runs at 47-49% speed. You can see some screencaps on one of the earlier pages. But since I prefer lower temperatures and am worried about VRM and memory temps not yet being reported by GPU-Z, I usually run it at 60% fan speed and get temps around 72C. The blower fan at 60% speed is quite loud (its a reference design from Sapphire).

At 100% fan speed, the overclocked card gets below 60C while mining but you can hear it from outside of the house at this point , so as lovely as these temps are this is not an option for me as it is also my gaming and work PC.

Yeah, so they still have not fixed that damn reference fan design. Aftermarket coolers FTW !

Damn ATI and their crap loud fan designs

Could be worse, I have a GTX460 that makes me want to tear my hair out
project10
Newbie

Offline

Activity: 8
Merit: 0

 January 10, 2012, 11:22:05 PM

Thanks for the numbers, I've got one on the way.
DiabloD3
Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

 January 10, 2012, 11:24:10 PM

I can't wait till the 7990 that is going to be impressive but expensive  I might have missed this but what is the heat like hashing overclocked ? and what fan speed

Overclocked @ 1125/975MHz with automatic fan speed I'm getting temperatures hovering 81-83C, and the fan runs at 47-49% speed. You can see some screencaps on one of the earlier pages. But since I prefer lower temperatures and am worried about VRM and memory temps not yet being reported by GPU-Z, I usually run it at 60% fan speed and get temps around 72C. The blower fan at 60% speed is quite loud (its a reference design from Sapphire).

At 100% fan speed, the overclocked card gets below 60C while mining but you can hear it from outside of the house at this point , so as lovely as these temps are this is not an option for me as it is also my gaming and work PC.

As a reminder, 100% fan speed is a good way to kill the fan, they were never meant to be ran that high.

Don't go above 85%.

1onevvolf
Newbie

Offline

Activity: 43
Merit: 0

 January 11, 2012, 12:36:56 AM

I've downloaded Sapphire's TriXX software and pushed the clocks even further, even managing to get the memory underclocked down to 150MHz:

 1150/150/1.17V 1175/150/1.17V 1200/150/1.17V 1225/150/1.175V 1250/150/1.2V Idle                   : 118 W 118 W 118 W 118 W 118 W Mining                 : 392 W 400 W 408 W 415 W 441 W Difference_(gfx_card_W): 274 W 282 W 290 W 297 W 323 W MH/s                   : 675MH/s 690MH/s 705MH/s 716MH/s 733MH/s MH/J_(system)          : 1.72 1.73 1.73 1.73 1.66 MH/J_(gfx_card_only)   : 2.46 2.45 2.43 2.41 2.27 MH/\$_(gfx_card_only)   : 1.23 1.25 1.28 1.30 1.33

Power draw reaches a tipping point around 1250MHz where the core voltage needs to start getting tweaked quite a bit for stable overclocks. I have a feeling that with more exotic cooling or insane fan speeds (I stuck with 60% which as I stated earlier is already quite obnoxious) that this card is sure to go much higher. The tool didn't let me lower the voltage either, which might be an interesting thing to do to increase efficiency.

1125Mhz still has the best efficiency with 2.51 MH/J.
A1BITCOINPOOL
Newbie

Offline

Activity: 56
Merit: 0

 January 11, 2012, 12:44:10 AM

For the price I would rather get a 5870.  I'm able to get 440 out of one of them and its more then half the price.
 Pages: 1 2 3 4 [5] 6 7 8 9 10 11  All