Bitcoin Forum
May 25, 2024, 10:32:37 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
  Home Help Search Login Register More  
  Show Posts
Pages: « 1 [2] 3 4 5 6 7 8 9 10 11 12 13 »
21  Bitcoin / Mining software (miners) / Re: Phatk2 GOFFSET Mod on: February 09, 2012, 09:17:42 AM
I'd advise you to change this:

Code:
#ifdef VECTORS8
if (v.s0 == g.s0)
{
nonce = W[3].s0;
}
if (v.s1 == g.s1)
{
nonce = W[3].s1;
}
if (v.s2 == g.s2)
{
nonce = W[3].s2;
}
if (v.s3 == g.s3)
{
nonce = W[3].s3;
}
if (v.s4 == g.s4)
{
nonce = W[3].s4;
}
if (v.s5 == g.s5)
{
nonce = W[3].s5;
}
if (v.s6 == g.s6)
{
nonce = W[3].s6;
}
if (v.s7 == g.s7)
{
nonce = W[3].s7;
}
#elif defined VECTORS4

To this:

Code:
#ifdef VECTORS8
        uint8 eq=(v==g);
        if (any(eq))
        {
              eq = select(g,(uint8)0,eq);
              nonce = (eq.s0+eq.s1+eq.s2+eq.s3+eq.s4+eq.s5+eq.s6+eq.s7);
        }
#elif defined VECTORS4

A branchless version would be:

Code:
#ifdef VECTORS8
        uint8 eq;
        eq = select(g,(uint8)0,(v==g));
        nonce = (eq.s0+eq.s1+eq.s2+eq.s3+eq.s4+eq.s5+eq.s6+eq.s7);
#elif defined VECTORS4

It incurs a penalty of several more ALU ops which might be acceptable or might not as compared to the one-branch version and this needs to be profiled.

Anyway, having 8 branches is a bad idea, even without divergence, this introduces at least 8 clauses and clause latency is ~40 cycles on VLIW hardware. Should be better on GCN though.
22  Bitcoin / Mining software (miners) / Re: new DiaKGCN kernel for Phoenix miner (7970 / GCN) - 2012-02-04 on: February 07, 2012, 12:04:24 AM
Nope. There is no reason extremely high (what's extreme btw? 256?) worksizes would work best. My bet would be 64 would work best. I might not be correct about this since I lack data and I lack data cause I am too lazy to profile current bitcoin kernels Smiley

The dumps I posted you are from Tahiti architecture, you can see no VLIW bundles and no clauses there and the GCN ISA is clearly different from the VLIW one Smiley Also, there is no reason why 2-component vectors would work best on Tahiti. Why do you think uint2 would work best? I don't think so. It might work well. It might not. You have run that through SKA or sprofile?

Once again (I am kinda tired of this so I am not going to reinstate that anymore), there is no 1:1 mapping between OpenCL vectors and the 79xx's vector ALU units. 79xx cannot "use more vectors" as "using more vectors" does not mean "use less instructions" on GCN hardware. Though frankly said I don't see a reason why am I arguing about that. Actually you are free to profile and benchmark. Again, do profile Smiley

23  Bitcoin / Mining software (miners) / Re: new DiaKGCN kernel for Phoenix miner (7970 / GCN) - 2012-02-04 on: February 06, 2012, 11:19:03 PM
You don't recognize Cayman ISA from Tahiti ISA? Smiley

Well, actually this is cross-compiled using the cl_amd_offline_devices extension. It is an AMD extension that lets you compile binary kernels for all hardware supported by the driver. The system I got the dumps and built kernels on is a 6870 one. It does not matter though as the generated binary is the same as the one you would get from clBuildProgram() on 79xx.
24  Bitcoin / Mining software (miners) / Re: new DiaKGCN kernel for Phoenix miner (7970 / GCN) - 2012-02-04 on: February 06, 2012, 09:48:18 PM
I'm telling you again, you've gotten that wrong. The vector ALU unit on GCNs is not meant to map 1:1 with opencl's vectors. The GCN architecture is scalar in nature. The purpose of vector ALU units is to handle ALU operations that are handled per-workitem rather than those that are handled on a per-workgroup basis. The vector ALU operations take 4 cycles to execute as compared to the 1 cycle on the scalar unit. There might be an advantage to vectorization in some cases but that's not because the vector unit behaves as a 16-wide SIMD unit (which is wrong btw). The vector unit "appears" to operate as a SIMD one, but that comes at the price of the instruction latency.

There is now a section on GCN architecture on the official APP SDK documentation:

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Along with everything else, it is clearly stated there:

Quote
Notes –
•   Vectorization is no longer needed, nor desirable. If your code used to
    combine work-items in order to get better VLIW use, this is no longer
    required.


Anyway, this can be easily demonstrated. Here is a very simple OpenCL kernel that shifts a kernel argument and writes it into an output buffer. This is with uint16 vectors:


Code:
__kernel void test(uint16 in,__global uint16 *dest)
{
dest[get_global_id(0)] = in>>2;
}

Here is the ISA dump:

Code:
  s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dwordx4  s[20:23], s[12:15], 0x04           // 0000000C: C28A0D04
  s_buffer_load_dwordx4  s[24:27], s[12:15], 0x08           // 00000010: C28C0D08
  s_buffer_load_dwordx4  s[28:31], s[12:15], 0x0c           // 00000014: C28E0D0C
  s_buffer_load_dword  s2, s[12:15], 0x10                   // 00000018: C2010D10
  s_waitcnt     lgkmcnt(0)                                  // 0000001C: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000020: 93000010
  s_add_i32     s0, s0, s1                                  // 00000024: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 00000028: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 0000002C: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000030: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000034: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 00000038: 9008820B
  v_lshlrev_b32  v0, 6, v0                                  // 0000003C: 34000086
  v_add_i32     v0, vcc, s2, v0                             // 00000040: 4A000002
  v_mov_b32     v1, s0                                      // 00000044: 7E020200
  v_mov_b32     v2, s1                                      // 00000048: 7E040201
  v_mov_b32     v3, s3                                      // 0000004C: 7E060203
  v_mov_b32     v4, s8                                      // 00000050: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBF71000 80010100
  s_lshr_b32    s0, s28, 2                                  // 0000005C: 9000821C
  s_lshr_b32    s1, s29, 2                                  // 00000060: 9001821D
  s_lshr_b32    s2, s30, 2                                  // 00000064: 9002821E
  s_lshr_b32    s3, s31, 2                                  // 00000068: 9003821F
  s_waitcnt     expcnt(0)                                   // 0000006C: BF8C1F0F
  v_mov_b32     v1, s0                                      // 00000070: 7E020200
  v_mov_b32     v2, s1                                      // 00000074: 7E040201
  v_mov_b32     v3, s2                                      // 00000078: 7E060202
  v_mov_b32     v4, s3                                      // 0000007C: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:48 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000080: EBF71030 80010100
  s_lshr_b32    s0, s24, 2                                  // 00000088: 90008218
  s_lshr_b32    s1, s25, 2                                  // 0000008C: 90018219
  s_lshr_b32    s2, s26, 2                                  // 00000090: 9002821A
  s_lshr_b32    s3, s27, 2                                  // 00000094: 9003821B
  s_waitcnt     expcnt(0)                                   // 00000098: BF8C1F0F
  v_mov_b32     v1, s0                                      // 0000009C: 7E020200
  v_mov_b32     v2, s1                                      // 000000A0: 7E040201
  v_mov_b32     v3, s2                                      // 000000A4: 7E060202
  v_mov_b32     v4, s3                                      // 000000A8: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:32 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000AC: EBF71020 80010100
  s_lshr_b32    s0, s20, 2                                  // 000000B4: 90008214
  s_lshr_b32    s1, s21, 2                                  // 000000B8: 90018215
  s_lshr_b32    s2, s22, 2                                  // 000000BC: 90028216
  s_lshr_b32    s3, s23, 2                                  // 000000C0: 90038217
  s_waitcnt     expcnt(0)                                   // 000000C4: BF8C1F0F
  v_mov_b32     v1, s0                                      // 000000C8: 7E020200
  v_mov_b32     v2, s1                                      // 000000CC: 7E040201
  v_mov_b32     v3, s2                                      // 000000D0: 7E060202
  v_mov_b32     v4, s3                                      // 000000D4: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000D8: EBF71010 80010100
  s_endpgm                                                  // 000000E0: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 228;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 32;



Now there is the uint4 version:


Code:
__kernel void test(uint4 in,__global uint4 *dest)
{
dest[get_global_id(0)] = in>>2;
}


Code:
  s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dword  s2, s[12:15], 0x04                   // 0000000C: C2010D04
  s_waitcnt     lgkmcnt(0)                                  // 00000010: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000014: 93000010
  s_add_i32     s0, s0, s1                                  // 00000018: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 0000001C: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 00000020: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000024: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000028: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 0000002C: 9008820B
  v_lshlrev_b32  v0, 4, v0                                  // 00000030: 34000084
  v_add_i32     v0, vcc, s2, v0                             // 00000034: 4A000002
  v_mov_b32     v1, s0                                      // 00000038: 7E020200
  v_mov_b32     v2, s1                                      // 0000003C: 7E040201
  v_mov_b32     v3, s3                                      // 00000040: 7E060203
  v_mov_b32     v4, s8                                      // 00000044: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000048: EBF71000 80010100
  s_endpgm                                                  // 00000050: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 84;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 18;



As you can see, the IL->ISA backend did not even bother to map the vector operations to the vector unit, it rather used the scalar unit exclusively. The first version does the 16 scalar shifts and wastes 32 SGPRs, the second one does the 4 scalar shifts and wastes 18 SGPRs.

Now before you say "why is it behaving like that", there are several reasons for this. Once again, stop thinking about OpenCL as something that should map 1:1 to hardware. OpenCL is a high-level API. Even with IL, you don't have that control. You cannot directly influence how is the backend going to map on the hardware.

As for your improved results, I would advise you to have a look at the python host code and/or the share rate as reported by the pool you are using. It is likely that the progress indicator is not reporting the correct speed for some reason (wrong NDRange calculation, wrong divisor or something like that). I've done those experiments in the past with mine and others' bitcoin kernels and in all cases, the kernel performance dropped abruptly with vectorization above 4 (due to reduced occupancy).
25  Bitcoin / Mining software (miners) / Re: new DiaKGCN kernel for Phoenix miner (7970 / GCN) - 2012-02-04 on: February 06, 2012, 10:38:18 AM
There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.

To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.

Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.


There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.
26  Bitcoin / Mining software (miners) / Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2012-01-13 on: January 18, 2012, 11:58:36 AM
There is no documentation yet. Those are the strings carved from libamdocl64.so. Additionaly, I've tested most of them (excluding max3/min3 and the sad ones) and they work. For some reason, you need to compile with -Dcl_amd_media_ops2, because just the pragma does not enable it.

For the full list see this thread:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=157516&messid=1274705&parentid=1274660&FTVAR_FORUMVIEWTMP=Branch
27  Bitcoin / Mining software (miners) / Re: further improved phatk OpenCL Kernel (> 4% increase) for Phoenix - 2012-01-13 on: January 18, 2012, 10:48:13 AM
Hello,

Unfortunately the amd_cl_media_ops2 extension has nothing to do with BFI_INT. There are amd_bfe() and amd_bfm() functions defined, but nothing that maps to bfi_int.

Can I have that pdf too please?
28  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking and tuning for cgminer and Diablominer? on: January 05, 2012, 11:52:11 PM
Well good news, looks like with 2.6, bitselect() is mapped to the corresponding bfi instruction on GCN archs. But not on VLIW ones. That is nice. So the bfi problem is solved.

29  Bitcoin / Mining / Re: Offline mining? on: January 04, 2012, 11:56:48 AM
Quote
Exactly.  Outside of Bitcoin nothing needs to perform trillions of hashes per second.

Password recovery and SL3 unlocking for sure does. Actually I would not be surprised if overall more computing resources are thrown at SL3 rather than bitcoin. It's more profitable overall.
30  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking and tuning for cgminer and Diablominer? on: January 02, 2012, 01:10:44 AM
Releasing a GCN-optimized miner would be a HUGE mistake.

Really, efforts should monetize. Just stop it before it's too late - you are turning the bitcoin mining community into an even more fucked up version of the SL3 one for yet another generation of GPU chips. Be responsible Smiley
31  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking? 1DbeWKCxnVCt3sRaSAmZLoboqr8pVyFzP1 on: December 28, 2011, 09:44:59 PM
Yay, we are all waiting for more than a year Smiley I constantly have problems with BFI patching on my kernels and besides that, there are some rare cases that amd_bytealign would do a good job for me but it's clobbered in a way due to that patching, grr!

But truth is it did not appear in SDK 2.6. I am not acquainted with AMD's OpenCL on Windows, but with linux since recently they started shipping the opencl library as part of the Catalyst suite rather than part of the SDK package. That means with each new driver you get new surprises, hopefully pleasant ones Smiley

However both 11.11 and 11.12 do not map bitselect to BFI_INT. Neither does the "preview" OpenCL 1.2 driver you can download from their site. It's annoying to say at least Sad

32  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking? 1DbeWKCxnVCt3sRaSAmZLoboqr8pVyFzP1 on: December 28, 2011, 09:19:10 PM
You are wrong about that. There is a BFI intruction in the GCN architecture - v_bfi_b32. Strange though, there is no "s_" equivalent on the scalar unit. But it is there. I have occasionally seen that in my Tahiti kernels and you can also grep it from libaticaldd.so.

Now bitselect() does not map to v_bfi_b32 for sure, yet some patterns of the kind (a&b)|(~a&c) where part of the variables are constants do produce bfi code.

I cannot patch the binary of course because the architecture is different and the opcodes are different as well. Trying to patch the binary the same way I do for VLIW kernels does nothing. I guess until the 79xx ISA reference comes out, using bfi on GCN hardware would not be possible. Unless they finally map bitselect to bfi of course.
33  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking? 1DbeWKCxnVCt3sRaSAmZLoboqr8pVyFzP1 on: December 28, 2011, 08:37:06 PM
What makes you think you won't need it?
34  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking? 1DbeWKCxnVCt3sRaSAmZLoboqr8pVyFzP1 on: December 28, 2011, 07:40:35 PM
It does not (tried personally) - bitselect is still not mapped to BFI_INT. Although it is now exposed at IL level at last.
35  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking? 1DbeWKCxnVCt3sRaSAmZLoboqr8pVyFzP1 on: December 28, 2011, 04:06:07 PM
Nice, that would explain why experimental results are just a bit better than 6970.

One more reason why optimized code for GCN should not be released in public Smiley
36  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking? 1DbeWKCxnVCt3sRaSAmZLoboqr8pVyFzP1 on: December 28, 2011, 12:54:29 PM
First thing is that _NO_ miner would work on GCN without modifications. The reason is simple: the BFI_INT replacement routine. GCN is a completely different architecture and so is the opcode. So no miner would produce correct results (if they produce any results anyway).

This needs to be changed.

Then GCN architecture puts an end to the need to vectorize. Any current kernel would be inefficient on GCN without a rewrite.

In fact without a rewrite, you are not getting slower, unoptimized version of the miner for the GCN hardware. You are not getting a working miner at all.
37  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking? 1DbeWKCxnVCt3sRaSAmZLoboqr8pVyFzP1 on: December 27, 2011, 11:05:20 PM
I think it's a good idea not to deliver kernels optimized for GCN at all. Since there appear to be no more than 7-8 people here that can code that, it won't be that hard Cheesy
38  Bitcoin / Mining / Re: Want legit 7970 testing/benchmarking? 1DbeWKCxnVCt3sRaSAmZLoboqr8pVyFzP1 on: December 27, 2011, 10:51:52 AM
I think this is a good opportunity for everyone that codes miners. Do not publicly release kernels optimized for GCN. Time to monetize the effort.
39  Bitcoin / Mining software (miners) / Re: *Catalyst 12.1 Preview* Decreased performance, anyone else confirm? on: December 23, 2011, 04:44:36 PM
Not quite, python is not among my strong sides. I may rewrite my miner though, just for the experiment. Anyway I have more important projects right now.
40  Bitcoin / Mining software (miners) / Re: *Catalyst 12.1 Preview* Decreased performance, anyone else confirm? on: December 22, 2011, 10:58:56 PM
Nope, preferred vector size does not always mean "best performance". Wider vectors mean more GPRs used and the more GPRs you use, the less wavefronts you can schedule on a CU, thus occupancy goes down. Also, on VLIW5 hardware, uint4 vectors are not optimal, it can happen that there are no 5 non-dependent instructions to fill the whole VLIW bundle. That depends on your code.

For example, with hash cracking you might end up with uint8 being much better than uint4 for kernels like the MD5 or NTLM one as ALUPacking goes up and the number of used GPRs is 10-20 at most. On the other hand, more complex algorithms like SHA512 are much better with uint2 or even a scalar implementation as the number of used GPRs greatly hampers the occupancy. With memory-intensive kernels like DES ones (thanks god bitcoin is not one), occupancy becomes even more important as more concurrency means memory access  latencies are more easily "hidden".

Back to uint2, problem with it is that it's even worse at utilizing all the slots in the VLIW bundle and your ALUPacking just always sucks. However, generally speaking with most bitcoin kernels it's a tradeoff worth having as bad occupancy in that particular case is worse than bad ALUPacking.

uint3 should provide a better balance between those, but it was broken in pre-2.6 APP SDK releases. Now they fixed it and I am curious about results...


PS as of why suddenly uint4 started performing better, it could be either that they iimproved scheduling or that they have improved the backend compiler to pack instructions better with uint4 / worse with uint2. It could be actually both.
Pages: « 1 [2] 3 4 5 6 7 8 9 10 11 12 13 »
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!