Bitcoin Forum
November 10, 2024, 10:51:38 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: GTX 680 Perf Estimations  (Read 8051 times)
mrb (OP)
Legendary
*
Offline Offline

Activity: 1512
Merit: 1028


View Profile WWW
March 23, 2012, 09:57:39 AM
 #1

Nvidia GTX 680 has 1536 ALUs running at 1006-1058 MHz. There are a few unknowns because AFAIK there is still no public information whether the Kepler architecture has a native instruction doing (a & b) | (~a & c) (aka BFI_INT on AMD GPUs, useful to implement maj() and ch() in SHA-256), and a native integer rotate instruction. Nonetheless, it is easy to calculate the upper bound of mining performance because it is an embarrassingly parallel workload, and we know how many instructions are required in these different scenarios:

  • 300-315 Mh/s assuming the Kepler microarchitecture still lacks BFI_INT and int rotate
  • 375-395 Mh/s if Nvidia added only int rotate
  • 450-475 Mh/s if Nvidia implemented both BFI_INT and int rotate natively

This would still be worse than AMD's comparable HD 7970 card at stock clocks (554 Mh/s); but definitely a step in the right direction for Nvidia who finally recognized that having (much) more ALUs at a (slightly) lower clock is better than fewer ALUs at a higher clock.
Vbs
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


View Profile
March 23, 2012, 10:36:55 AM
Last edit: March 23, 2012, 11:04:02 AM by Vbs
 #2

Let's see...

A single GTX 580 with 512 shaders at default clocks (core 772MHz, shaders 1544MHz) gives around 140MH/s. It has 2 warp schedulers (each doing 2 instructions per core clock cycle) per SM.

A GTX 680 has 4 warp schedulers per SMX, so the theoretical performance should be around 140*(2*1536*1058)/(512*1544) = 576MH/s.

Quote
http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2
In GF114 each SM contained 48 CUDA cores, with the 48 cores organized into 3 groups of 16. Joining those 3 groups of CUDA cores were 16 load/store units, 16 interpolation SFUs, 8 special function SFUs, and 8 texture units. Feeding all of those blocks was a pair of warp schedulers, each of which could issue up to 2 instructions per core clock cycle, for a total of up to 4 instructions in flight at any given time.

Quote
http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2
Ultimately where the doubling of the size of the functional units allowed NVIDIA to drop the shader clock, it’s the second doubling of resources that makes GK104 much more powerful than GF114. The SMX is in nearly every significant way twice as powerful as a GF114 SM. At the end of the day NVIDIA already had a strong architecture in Fermi, so with Kepler they’ve gone and done the most logical thing to improve their performance: they’ve simply doubled Fermi.

Altogether the SMX now has 15 functional units that the warp schedulers can call on. Each of the 4 schedulers in turn can issue up to 2 instructions per clock if there’s ILP to be extracted from their respective warps, allowing the schedulers as a whole to issue instructions to up to 8 of the 15 functional units in any clock cycle.
mrb (OP)
Legendary
*
Offline Offline

Activity: 1512
Merit: 1028


View Profile WWW
March 23, 2012, 11:01:56 AM
 #3

Having 4 warp schedulers instead of 2 does not mean that you can execute twice the number of instructions per clock. It just means you have 4 warps of threads which take turn to be executed instead of 2.

In other words, divide your 576 Mh/s number by two: 288 Mh/s, which is within 4% of my first prediction (300-315 Mh/s).
Vbs
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


View Profile
March 23, 2012, 11:11:29 AM
 #4

Quote
http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2
Because NVIDIA has essentially traded a fewer number of higher clocked units for a larger number of lower clocked units, NVIDIA had to go in and double the size of each functional unit inside their SM. Whereas a block of 16 CUDA cores would do when there was a shader clock, now a full 32 CUDA cores are necessary. The same is true for the load/store units and the special function units, all of which have been doubled in size in order to adjust for the lack of a shader clock. Consequently, this is why we can’t just immediately compare the CUDA core count of GK104 and GF114 and call GK104 4 times as powerful; half of that additional hardware is just to make up for the lack of a shader clock.

But of course NVIDIA didn’t stop there, as swapping out the shader clock for larger functional units only gives us the same throughput in the end. After doubling the size of the functional units in a SM, NVIDIA then doubled the number of functional units in each SM in order to grow the performance of the SM itself. 3 groups of CUDA cores became 6 groups of CUDA cores, 2 groups of load/store units, 16 texture units, etc. At the same time, with twice as many functional units NVIDIA also doubled the other execution resources, with 2 warp schedulers becoming 4 warp schedulers, and the register file being doubled from 32K entries to 64K entries.
mrb (OP)
Legendary
*
Offline Offline

Activity: 1512
Merit: 1028


View Profile WWW
March 23, 2012, 11:23:22 AM
Last edit: March 23, 2012, 11:43:25 AM by mrb
 #5

Yes. Two doublings. Number of cores quadrupled in an SM (now SMX). From 48 to 192. But I already take this into account in my calculation (192 cores in an SMX * 8 SMX = 1536 cores, which I base my numbers on). I think you are picking up isolated sentences without understanding the whole picture of how the GPU works.

Another way of seeing it, if I may grossly simplify, is that this Anandtech article says the GTX 680 approximately quadrupled the performance of a GTX 460 (which is approximately true: 4x the # of cores per SM/SMX, about same number of SM/SMX 7 vs 8, and the same shader clock 1006 vs 1350 MHz (+/- 30%)). GTX 460 mines at ~70 Mh/s, therefore GTX 680 would mine at ~280 Mh/s (+/- 30%), which is again consistent with my first estimation of 300-315 Mhash/s (assuming no BFI_INT or int rotate). This is a gross approximation but you get the idea.
Vbs
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


View Profile
March 23, 2012, 03:33:04 PM
Last edit: March 23, 2012, 03:56:45 PM by Vbs
 #6

Quote

I completely understand your point. The thing is nVidia has specifically stated that Kepler doubles the instructions per clock of Fermi, so my calcs are based on that. Whether it holds true or not in real benchs, I dunno. Tongue

Quote
http://www.extremetech.com/extreme/123271-nvidias-gtx-680-emphasizes-efficiency-pours-on-the-speed
Nvidia’s ratio column is remarkably unhelpful; it only describes the increase between Fermi and Kepler rather than how resources are distributed relative to each other. GK104 packs four times the special function units (SFUs) and twice the texture units as GF110; the core is capable of processing twice as many instructions per clock (though it has three times as many cores to fill with those instructions).

A hypothetical GTX 580 SM core with 96 shaders@772MHz should still have the same performance of a real SM with 48 shaders@1544MHz, because it could only process 4 instructions per clock cycle (2 warps). A GTX 680 SMX core however is capable of doing 8 instructions per clock cycle (4 warps). It's the number of warp schedulers + shaders to fill that ultimately determines max performance.

The big question is "Can 4 warp schedulers keep 196 shaders filled simultaneously, more than 2 warp schedulers can keep "96" filled simultaneously?". If so, you have double more water taps, with double more buckets to fill.
DeathAndTaxes
Donator
Legendary
*
Offline Offline

Activity: 1218
Merit: 1079


Gerald Davis


View Profile
March 23, 2012, 03:57:23 PM
 #7

I found this interesting:



The hashing is SHA-256 and Encrypion is AES-256.

What is interesting is the performance boost in AES-256 both relative to 7970 and also relative to 580 (not shown but roughly 6x faster).

SHA-256 is still dismal.  My guess is NVidia added some crypto instructions but not ones useful for SHA-256.
mrb (OP)
Legendary
*
Offline Offline

Activity: 1512
Merit: 1028


View Profile WWW
March 23, 2012, 06:49:59 PM
Last edit: March 23, 2012, 10:28:11 PM by mrb
 #8

Vbs, if this interpretation was true, then GTX 680 would be a 6180 single precision GFLOPS chip, but it is actually only 3090 SP GFLOPS...

Perhaps talking about GFLOPS is an easier way to convince you... In all recent Nvidia and AMD GPUs, the number of 32-bit integer instructions executable per clock is linearly proportional to the number of single precision floating point instructions executable per clock. (This was not true with the old GT200 because Nvidia counted GFLOPS assuming mul+mad or 3 fp operations/clock, anyway I digress.) Per Nvidia's published figures:

GTX 460 = 907 SP GFLOPS
GTX 680 = 3090 SP GFLOPS
And 3090/907 = 3.4x (or about the same 4x ratio I mentioned earlier)

Therefore GTX 680 would only mine 3.4x faster than the GTX 460 (again assuming no BFI_INT and int rotate).

What Nvidia mean by "twice the number of instructions per clock" is that a whole SM/SMX executes twice as many instructions per core clock:  a GTX 460's SM executes 48 (# of shaders) * 2 (shader clock is 2x the core clock) = 96 instructions per core clock; a GTX 680's SMX executes 192 (# of shaders) * 1 (shader clock same as core clock) = 192 instructions per core clock.
mrb (OP)
Legendary
*
Offline Offline

Activity: 1512
Merit: 1028


View Profile WWW
March 23, 2012, 07:22:54 PM
 #9

I found this interesting:

The value of these benchmarks to predict Bitcoin performance is null. SiSoft publishes almost no information as to what actually their GPU code is doing. For all we know, they are probably memory-bound instead of ALU-bound. And the huge discrepancy disfavoring AES-128 on AMD, and SHA-256 on Nvidia suggests their code is poorly optimized in these respective cases and/or exposes flaws in the OpenCL compilers... There is no reason for SHA-256 hashing to be that slow on Nvidia (even if GTX 680 has no BFI_INT and no int rotate, it should be at least about half as fast as HD 7970). Conversely, there is no reason for AES-128 to be that slow on AMD.
bulanula
Hero Member
*****
Offline Offline

Activity: 518
Merit: 500



View Profile
March 23, 2012, 11:48:52 PM
 #10

Nvidia GTX 680 has 1536 ALUs running at 1006-1058 MHz. There are a few unknowns because AFAIK there is still no public information whether the Kepler architecture has a native instruction doing (a & b) | (~a & c) (aka BFI_INT on AMD GPUs, useful to implement maj() and ch() in SHA-256), and a native integer rotate instruction. Nonetheless, it is easy to calculate the upper bound of mining performance because it is an embarrassingly parallel workload, and we know how many instructions are required in these different scenarios:

  • 300-315 Mh/s assuming the Kepler microarchitecture still lacks BFI_INT and int rotate
  • 375-395 Mh/s if Nvidia added only int rotate
  • 450-475 Mh/s if Nvidia implemented both BFI_INT and int rotate natively

This would still be worse than AMD's comparable HD 7970 card at stock clocks (554 Mh/s); but definitely a step in the right direction for Nvidia who finally recognized that having (much) more ALUs at a (slightly) lower clock is better than fewer ALUs at a higher clock.

If this gets 470 mhash/s I still think the 7970 wins by a lot Sad

Really wanted some competition to force AMD to get better drivers for Linux.
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
March 24, 2012, 12:44:23 AM
 #11

Quote
A hypothetical GTX 580 SM core with 96 shaders@772MHz should still have the same performance of a real SM with 48 shaders@1544MHz, because it could only process 4 instructions per clock cycle (2 warps). A GTX 680 SMX core however is capable of doing 8 instructions per clock cycle (4 warps). It's the number of warp schedulers + shaders to fill that ultimately determines max performance.


Nope. I think you don't quite understand how a GPU functions. Best approximation I can give you is with hyperthreading that has 4 register sets rather than 2. So yes, if there is a memory fetch operation, there would be 3 warps in-flight rather than 1. So yes, Kepler would be good for memory-intensive tasks (which might explain the AES case if they blowed up the lookup tables and did not fit them in __local memory).

But no, there is no 4x instruction throughput. In that aspect, it's the same as Fermi. BTW a warp does not "execute" for just one clock cycle, it executes for much more, more than 20 clock cycles and there is of course a pipeline. With sm_1x architecture, the pipeline was fucked up and new instructions were fetched/retired once per 4 clocks. Fermi impoved that to once per 2 clocks. From what I read in the pdf, Kepler does exactly the same as Fermi. Now the question is, sm_21 archs introduced something like out-of-order execution where similar instructions on different independant data could be "batched". This in turn lead to vectorizing OpenCL code for sm_21 and stuff like GTX460 being ~60% faster when uint2 vectors used. I am really wondering how far did they got with that in GK104 Smiley

Vbs
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


View Profile
March 24, 2012, 01:05:04 AM
 #12

If I've been completely wrong here, I apologize. Smiley

Anyway, looking at the GTX 680 Whitepaper again, page 7, "comparing the chip level unit counts for GeForce GTX 580 (containing 16 SMs) to GeForce GTX 680 (containing 8 SMXs)", the ratio of instructions/clock between both is 2.0x. So, 140*2*1058/772 = 384 MH/s (they specify a 2.6x ratio w/ clock freq, but it should be 1058MHz core, not 1006MHz).

I think I'm gonna wait for real results at this point now. Smiley
mrb (OP)
Legendary
*
Offline Offline

Activity: 1512
Merit: 1028


View Profile WWW
March 25, 2012, 03:09:01 AM
Last edit: March 25, 2012, 03:50:31 AM by mrb
 #13

Your math is almost right, except this:

The Kepler SMX can execute 2x as many instruction as a 48-shader Fermi SM, but it can execute 3x as many instructions as a 32-shader Fermi SM (which the GTX 580 is an example of). Therefore do not multiply by 2, but by 3.

You forgot to divide the end result by 2 because, as you pointed out, the number of SMXs in the GTX 680 is half the number of SMs in the GTX 580.

Finally, a GTX 580 can mine at 149 Mh/s with a properly fine-tuned miner, not 140 Mh/s. (An entry in the wiki claims this with rpcminer-cuda.exe set to use a high aggression level parameter.)

Correcting these 3 mistakes give a GTX 680 estimation of: 149*3*1058/772/2 = 306 Mh/s... which matches exactly my first estimation (300-315 Mh/s).
Schleicher
Hero Member
*****
Offline Offline

Activity: 675
Merit: 514



View Profile
March 25, 2012, 04:33:04 AM
Last edit: March 25, 2012, 05:33:15 AM by Schleicher
 #14

Actually, according to the new CUDA 4.2 Programming Guide:
additions are 5.25 times faster
comparisons and shifts are 50% slower
logic operations are 4.25 times faster
multiplications are twice as fast
(per clock and multiprocessor, compared to GTX580)

mrb (OP)
Legendary
*
Offline Offline

Activity: 1512
Merit: 1028


View Profile WWW
March 25, 2012, 05:26:37 AM
Last edit: March 26, 2012, 06:46:11 AM by mrb
 #15

Schleicher: very funny Cheesy

Edit: I was privately contacted to explain my reaction to his post... Schleicher's post is obviously untrue. There are no such performance improvements in the GTX 680. I assume he was being humorous.
bulanula
Hero Member
*****
Offline Offline

Activity: 518
Merit: 500



View Profile
March 25, 2012, 02:53:32 PM
 #16

Strange that at this point in time we still don't have ANY real results ...

Maybe someone found some magic inside and getting 1 ghash/s using this card Cheesy

Unlikely but the lack of results really is strange since this card owns 7970 for gaming in all points.

Well, at least I hope this will bring 7990 prices down !
film2240
Legendary
*
Offline Offline

Activity: 1022
Merit: 1000


Freelance videographer


View Profile WWW
March 25, 2012, 03:22:25 PM
 #17

I think Nvidia are starting to get serious about improving hash power. I'd still prefer dual GPU single slot cards as they take up less space in my case (yes,I know about the high power use and heat).

When will Nvidia release a dual GPU single slot card so I can see the hash power of it?

[This signature is available for rent.BTC/ETH/LTC or £50 equivalent a month]
[This signature is available for rent.BTC/ETH/LTC or £50 equivalent a month]
[This signature is available for rent.BTC/ETH/LTC or £50 equivalent a month]
DeathAndTaxes
Donator
Legendary
*
Offline Offline

Activity: 1218
Merit: 1079


Gerald Davis


View Profile
March 25, 2012, 03:26:35 PM
 #18

Dual GPU Single slot?  Sounds like a horrible idea.
film2240
Legendary
*
Offline Offline

Activity: 1022
Merit: 1000


Freelance videographer


View Profile WWW
March 25, 2012, 03:28:33 PM
 #19

Dual GPU Single slot?  Sounds like a horrible idea.

How's that a bad idea? I can recall people wanting a HD6990 (when they were at the top) badly so the Dual Gpu card (within a single gfx board.I meant single gfx board not single slot sorry as that would be impossible)

[This signature is available for rent.BTC/ETH/LTC or £50 equivalent a month]
[This signature is available for rent.BTC/ETH/LTC or £50 equivalent a month]
[This signature is available for rent.BTC/ETH/LTC or £50 equivalent a month]
Dyaheon
Member
**
Offline Offline

Activity: 121
Merit: 10


View Profile
March 25, 2012, 05:09:27 PM
 #20

So you mean GTX 690 or such? That's possibly coming in a few months, but then again you can make an educated guess of one's hashing rate from 680's rate. Which we don't know yet though...

HD 7990 is probably coming pretty soon too, likely before GTX 690.
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!