d3m0n1q_733rz
|
|
February 06, 2012, 04:59:45 AM |
|
Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of. I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done. Higher-end cards will, of course, see better results than lower-end ones. I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.
I could implement uint16, should be pretty straight forward, but massive vectorisation is really something GCN does not like currently. Dia http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/3I think it might be because the full 16 vectors are loaded and unloaded to make room for anything else that needs to be computed. In theory, 16 vectors at once is the best approach, but that only applies if we're doing math for only the 16 vectors as that's the maximum the ALUs can hold. In other words, the moment something else needs to be loaded, it has to pull the entire 512-byte integer from the ALUs to put into the cache, load the data to be computed, unload it, then reload the 512-byte integers. But the GCN is supposedly a true 16 vector design so I think the problem is the overhead that's created loading and unloading. With the 8 vectors, did you try the worksize of 64 to see if it was any faster? PS Bad news--new Phoenix 2 miner. I've suggested they make changes to the phatk2 kernel like you've made for your GCN here. Like adding the GOFFSET option and increasing vector sizes.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
d3m0n1q_733rz
|
|
February 06, 2012, 07:34:18 AM Last edit: February 06, 2012, 09:02:30 AM by d3m0n1q_733rz |
|
Anyhow, going back to what I was saying, Dia, I think that the best kernel design for GCN is one which will compute four 512-byte integers. Since it can compute one in 4 cycles or 4 in 4 cycles, it seems best to attempt to compute 4 sets of 16 vectors to the fullest extent of the ALUs. Alternatively, you could compute 3 sets and leave the last SIMD for computing other works required by the kernel such as nonce and the like. So, multi-threading is brought into play with the GCN processors. The problem is that these aren't multi-GPUs, these are multi-SIMD GPUs which makes coding a little more tricky. I might be a little over-zealous to think that these are capable of handling four times the amount of mining at one time, but it seems like the approach to take. The biggest suggestion I could make, though, is to drop the worksize down to allow for the increased vectors. You should see some improvement with VECTORS8, but I can't promise it so.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
gat3way
|
|
February 06, 2012, 10:38:18 AM Last edit: February 06, 2012, 10:48:27 AM by gat3way |
|
There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.
To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.
Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.
There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.
|
|
|
|
d3m0n1q_733rz
|
|
February 06, 2012, 11:09:05 AM Last edit: February 06, 2012, 11:28:50 AM by d3m0n1q_733rz |
|
There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.
To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.
Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.
There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.
Alright, but when it came to 8 vectors, you can't argue with results. I've posted the table of gains with the VLIW5 hardware I use. And please read the papers on the GCN again (assuming you read them once) as it's clearly stated that, "Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront. This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN." Now, as I was saying, since the SIMDs are 16-wide and there are 4 of them. Each SIMD could be loaded with 16 vectors each which would allow the calculations to be run on all of them without wasting any clock cycles. Four 16-vectors at once sounds pretty good to me. The Cayman takes 4 clock cycles due to SPs being used. The GCN handles them in one. You do the math. Now, I don't know why Dia's been getting lower hash results with 8 vectors having more ALUs to handle them. But I have the HD5450 and I get the highest hashing rate using 8 vectors and a worksize of 64. You can find my results on the previous page. Oddly enough, it's on the VLIW5 which isn't 8-wide. 8-wide would be using half of the ALUs of a single SIMD on a GCN. So what gives? BTW, I'm talking about the 79xx series.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
gat3way
|
|
February 06, 2012, 09:48:18 PM |
|
I'm telling you again, you've gotten that wrong. The vector ALU unit on GCNs is not meant to map 1:1 with opencl's vectors. The GCN architecture is scalar in nature. The purpose of vector ALU units is to handle ALU operations that are handled per-workitem rather than those that are handled on a per-workgroup basis. The vector ALU operations take 4 cycles to execute as compared to the 1 cycle on the scalar unit. There might be an advantage to vectorization in some cases but that's not because the vector unit behaves as a 16-wide SIMD unit (which is wrong btw). The vector unit "appears" to operate as a SIMD one, but that comes at the price of the instruction latency. There is now a section on GCN architecture on the official APP SDK documentation: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdfAlong with everything else, it is clearly stated there: Notes – • Vectorization is no longer needed, nor desirable. If your code used to combine work-items in order to get better VLIW use, this is no longer required.
Anyway, this can be easily demonstrated. Here is a very simple OpenCL kernel that shifts a kernel argument and writes it into an output buffer. This is with uint16 vectors: __kernel void test(uint16 in,__global uint16 *dest) { dest[get_global_id(0)] = in>>2; }
Here is the ISA dump: s_buffer_load_dword s0, s[8:11], 0x04 // 00000000: C2000904 s_buffer_load_dword s1, s[8:11], 0x18 // 00000004: C2008918 s_buffer_load_dwordx4 s[8:11], s[12:15], 0x00 // 00000008: C2840D00 s_buffer_load_dwordx4 s[20:23], s[12:15], 0x04 // 0000000C: C28A0D04 s_buffer_load_dwordx4 s[24:27], s[12:15], 0x08 // 00000010: C28C0D08 s_buffer_load_dwordx4 s[28:31], s[12:15], 0x0c // 00000014: C28E0D0C s_buffer_load_dword s2, s[12:15], 0x10 // 00000018: C2010D10 s_waitcnt lgkmcnt(0) // 0000001C: BF8C007F s_mul_i32 s0, s16, s0 // 00000020: 93000010 s_add_i32 s0, s0, s1 // 00000024: 81000100 v_add_i32 v0, vcc, s0, v0 // 00000028: 4A000000 s_lshr_b32 s0, s8, 2 // 0000002C: 90008208 s_lshr_b32 s1, s9, 2 // 00000030: 90018209 s_lshr_b32 s3, s10, 2 // 00000034: 9003820A s_lshr_b32 s8, s11, 2 // 00000038: 9008820B v_lshlrev_b32 v0, 6, v0 // 0000003C: 34000086 v_add_i32 v0, vcc, s2, v0 // 00000040: 4A000002 v_mov_b32 v1, s0 // 00000044: 7E020200 v_mov_b32 v2, s1 // 00000048: 7E040201 v_mov_b32 v3, s3 // 0000004C: 7E060203 v_mov_b32 v4, s8 // 00000050: 7E080208 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBF71000 80010100 s_lshr_b32 s0, s28, 2 // 0000005C: 9000821C s_lshr_b32 s1, s29, 2 // 00000060: 9001821D s_lshr_b32 s2, s30, 2 // 00000064: 9002821E s_lshr_b32 s3, s31, 2 // 00000068: 9003821F s_waitcnt expcnt(0) // 0000006C: BF8C1F0F v_mov_b32 v1, s0 // 00000070: 7E020200 v_mov_b32 v2, s1 // 00000074: 7E040201 v_mov_b32 v3, s2 // 00000078: 7E060202 v_mov_b32 v4, s3 // 0000007C: 7E080203 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen offset:48 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000080: EBF71030 80010100 s_lshr_b32 s0, s24, 2 // 00000088: 90008218 s_lshr_b32 s1, s25, 2 // 0000008C: 90018219 s_lshr_b32 s2, s26, 2 // 00000090: 9002821A s_lshr_b32 s3, s27, 2 // 00000094: 9003821B s_waitcnt expcnt(0) // 00000098: BF8C1F0F v_mov_b32 v1, s0 // 0000009C: 7E020200 v_mov_b32 v2, s1 // 000000A0: 7E040201 v_mov_b32 v3, s2 // 000000A4: 7E060202 v_mov_b32 v4, s3 // 000000A8: 7E080203 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen offset:32 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000AC: EBF71020 80010100 s_lshr_b32 s0, s20, 2 // 000000B4: 90008214 s_lshr_b32 s1, s21, 2 // 000000B8: 90018215 s_lshr_b32 s2, s22, 2 // 000000BC: 90028216 s_lshr_b32 s3, s23, 2 // 000000C0: 90038217 s_waitcnt expcnt(0) // 000000C4: BF8C1F0F v_mov_b32 v1, s0 // 000000C8: 7E020200 v_mov_b32 v2, s1 // 000000CC: 7E040201 v_mov_b32 v3, s2 // 000000D0: 7E060202 v_mov_b32 v4, s3 // 000000D4: 7E080203 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000D8: EBF71010 80010100 s_endpgm // 000000E0: BF810000 end
; ----------------- CS Data ------------------------
codeLenInByte = 228;Bytes
userElementCount = 3; ; userElements[0] = IMM_UAV, 10, s[4:7] ; userElements[1] = IMM_CONST_BUFFER, 0, s[8:11] ; userElements[2] = IMM_CONST_BUFFER, 1, s[12:15] extUserElementCount = 0; NumVgprs = 5; NumSgprs = 32;
Now there is the uint4 version: __kernel void test(uint4 in,__global uint4 *dest) { dest[get_global_id(0)] = in>>2; }
s_buffer_load_dword s0, s[8:11], 0x04 // 00000000: C2000904 s_buffer_load_dword s1, s[8:11], 0x18 // 00000004: C2008918 s_buffer_load_dwordx4 s[8:11], s[12:15], 0x00 // 00000008: C2840D00 s_buffer_load_dword s2, s[12:15], 0x04 // 0000000C: C2010D04 s_waitcnt lgkmcnt(0) // 00000010: BF8C007F s_mul_i32 s0, s16, s0 // 00000014: 93000010 s_add_i32 s0, s0, s1 // 00000018: 81000100 v_add_i32 v0, vcc, s0, v0 // 0000001C: 4A000000 s_lshr_b32 s0, s8, 2 // 00000020: 90008208 s_lshr_b32 s1, s9, 2 // 00000024: 90018209 s_lshr_b32 s3, s10, 2 // 00000028: 9003820A s_lshr_b32 s8, s11, 2 // 0000002C: 9008820B v_lshlrev_b32 v0, 4, v0 // 00000030: 34000084 v_add_i32 v0, vcc, s2, v0 // 00000034: 4A000002 v_mov_b32 v1, s0 // 00000038: 7E020200 v_mov_b32 v2, s1 // 0000003C: 7E040201 v_mov_b32 v3, s3 // 00000040: 7E060203 v_mov_b32 v4, s8 // 00000044: 7E080208 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000048: EBF71000 80010100 s_endpgm // 00000050: BF810000 end
; ----------------- CS Data ------------------------
codeLenInByte = 84;Bytes
userElementCount = 3; ; userElements[0] = IMM_UAV, 10, s[4:7] ; userElements[1] = IMM_CONST_BUFFER, 0, s[8:11] ; userElements[2] = IMM_CONST_BUFFER, 1, s[12:15] extUserElementCount = 0; NumVgprs = 5; NumSgprs = 18;
As you can see, the IL->ISA backend did not even bother to map the vector operations to the vector unit, it rather used the scalar unit exclusively. The first version does the 16 scalar shifts and wastes 32 SGPRs, the second one does the 4 scalar shifts and wastes 18 SGPRs. Now before you say "why is it behaving like that", there are several reasons for this. Once again, stop thinking about OpenCL as something that should map 1:1 to hardware. OpenCL is a high-level API. Even with IL, you don't have that control. You cannot directly influence how is the backend going to map on the hardware. As for your improved results, I would advise you to have a look at the python host code and/or the share rate as reported by the pool you are using. It is likely that the progress indicator is not reporting the correct speed for some reason (wrong NDRange calculation, wrong divisor or something like that). I've done those experiments in the past with mine and others' bitcoin kernels and in all cases, the kernel performance dropped abruptly with vectorization above 4 (due to reduced occupancy).
|
|
|
|
d3m0n1q_733rz
|
|
February 06, 2012, 11:10:38 PM |
|
I'm telling you again, you've gotten that wrong. The vector ALU unit on GCNs is not meant to map 1:1 with opencl's vectors. The GCN architecture is scalar in nature. The purpose of vector ALU units is to handle ALU operations that are handled per-workitem rather than those that are handled on a per-workgroup basis. The vector ALU operations take 4 cycles to execute as compared to the 1 cycle on the scalar unit. There might be an advantage to vectorization in some cases but that's not because the vector unit behaves as a 16-wide SIMD unit (which is wrong btw). The vector unit "appears" to operate as a SIMD one, but that comes at the price of the instruction latency. There is now a section on GCN architecture on the official APP SDK documentation: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdfAlong with everything else, it is clearly stated there: Notes – • Vectorization is no longer needed, nor desirable. If your code used to combine work-items in order to get better VLIW use, this is no longer required.
Anyway, this can be easily demonstrated. Here is a very simple OpenCL kernel that shifts a kernel argument and writes it into an output buffer. This is with uint16 vectors: __kernel void test(uint16 in,__global uint16 *dest) { dest[get_global_id(0)] = in>>2; }
Here is the ISA dump: s_buffer_load_dword s0, s[8:11], 0x04 // 00000000: C2000904 s_buffer_load_dword s1, s[8:11], 0x18 // 00000004: C2008918 s_buffer_load_dwordx4 s[8:11], s[12:15], 0x00 // 00000008: C2840D00 s_buffer_load_dwordx4 s[20:23], s[12:15], 0x04 // 0000000C: C28A0D04 s_buffer_load_dwordx4 s[24:27], s[12:15], 0x08 // 00000010: C28C0D08 s_buffer_load_dwordx4 s[28:31], s[12:15], 0x0c // 00000014: C28E0D0C s_buffer_load_dword s2, s[12:15], 0x10 // 00000018: C2010D10 s_waitcnt lgkmcnt(0) // 0000001C: BF8C007F s_mul_i32 s0, s16, s0 // 00000020: 93000010 s_add_i32 s0, s0, s1 // 00000024: 81000100 v_add_i32 v0, vcc, s0, v0 // 00000028: 4A000000 s_lshr_b32 s0, s8, 2 // 0000002C: 90008208 s_lshr_b32 s1, s9, 2 // 00000030: 90018209 s_lshr_b32 s3, s10, 2 // 00000034: 9003820A s_lshr_b32 s8, s11, 2 // 00000038: 9008820B v_lshlrev_b32 v0, 6, v0 // 0000003C: 34000086 v_add_i32 v0, vcc, s2, v0 // 00000040: 4A000002 v_mov_b32 v1, s0 // 00000044: 7E020200 v_mov_b32 v2, s1 // 00000048: 7E040201 v_mov_b32 v3, s3 // 0000004C: 7E060203 v_mov_b32 v4, s8 // 00000050: 7E080208 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBF71000 80010100 s_lshr_b32 s0, s28, 2 // 0000005C: 9000821C s_lshr_b32 s1, s29, 2 // 00000060: 9001821D s_lshr_b32 s2, s30, 2 // 00000064: 9002821E s_lshr_b32 s3, s31, 2 // 00000068: 9003821F s_waitcnt expcnt(0) // 0000006C: BF8C1F0F v_mov_b32 v1, s0 // 00000070: 7E020200 v_mov_b32 v2, s1 // 00000074: 7E040201 v_mov_b32 v3, s2 // 00000078: 7E060202 v_mov_b32 v4, s3 // 0000007C: 7E080203 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen offset:48 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000080: EBF71030 80010100 s_lshr_b32 s0, s24, 2 // 00000088: 90008218 s_lshr_b32 s1, s25, 2 // 0000008C: 90018219 s_lshr_b32 s2, s26, 2 // 00000090: 9002821A s_lshr_b32 s3, s27, 2 // 00000094: 9003821B s_waitcnt expcnt(0) // 00000098: BF8C1F0F v_mov_b32 v1, s0 // 0000009C: 7E020200 v_mov_b32 v2, s1 // 000000A0: 7E040201 v_mov_b32 v3, s2 // 000000A4: 7E060202 v_mov_b32 v4, s3 // 000000A8: 7E080203 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen offset:32 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000AC: EBF71020 80010100 s_lshr_b32 s0, s20, 2 // 000000B4: 90008214 s_lshr_b32 s1, s21, 2 // 000000B8: 90018215 s_lshr_b32 s2, s22, 2 // 000000BC: 90028216 s_lshr_b32 s3, s23, 2 // 000000C0: 90038217 s_waitcnt expcnt(0) // 000000C4: BF8C1F0F v_mov_b32 v1, s0 // 000000C8: 7E020200 v_mov_b32 v2, s1 // 000000CC: 7E040201 v_mov_b32 v3, s2 // 000000D0: 7E060202 v_mov_b32 v4, s3 // 000000D4: 7E080203 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000D8: EBF71010 80010100 s_endpgm // 000000E0: BF810000 end
; ----------------- CS Data ------------------------
codeLenInByte = 228;Bytes
userElementCount = 3; ; userElements[0] = IMM_UAV, 10, s[4:7] ; userElements[1] = IMM_CONST_BUFFER, 0, s[8:11] ; userElements[2] = IMM_CONST_BUFFER, 1, s[12:15] extUserElementCount = 0; NumVgprs = 5; NumSgprs = 32;
Now there is the uint4 version: __kernel void test(uint4 in,__global uint4 *dest) { dest[get_global_id(0)] = in>>2; }
s_buffer_load_dword s0, s[8:11], 0x04 // 00000000: C2000904 s_buffer_load_dword s1, s[8:11], 0x18 // 00000004: C2008918 s_buffer_load_dwordx4 s[8:11], s[12:15], 0x00 // 00000008: C2840D00 s_buffer_load_dword s2, s[12:15], 0x04 // 0000000C: C2010D04 s_waitcnt lgkmcnt(0) // 00000010: BF8C007F s_mul_i32 s0, s16, s0 // 00000014: 93000010 s_add_i32 s0, s0, s1 // 00000018: 81000100 v_add_i32 v0, vcc, s0, v0 // 0000001C: 4A000000 s_lshr_b32 s0, s8, 2 // 00000020: 90008208 s_lshr_b32 s1, s9, 2 // 00000024: 90018209 s_lshr_b32 s3, s10, 2 // 00000028: 9003820A s_lshr_b32 s8, s11, 2 // 0000002C: 9008820B v_lshlrev_b32 v0, 4, v0 // 00000030: 34000084 v_add_i32 v0, vcc, s2, v0 // 00000034: 4A000002 v_mov_b32 v1, s0 // 00000038: 7E020200 v_mov_b32 v2, s1 // 0000003C: 7E040201 v_mov_b32 v3, s3 // 00000040: 7E060203 v_mov_b32 v4, s8 // 00000044: 7E080208 tbuffer_store_format_xyzw v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000048: EBF71000 80010100 s_endpgm // 00000050: BF810000 end
; ----------------- CS Data ------------------------
codeLenInByte = 84;Bytes
userElementCount = 3; ; userElements[0] = IMM_UAV, 10, s[4:7] ; userElements[1] = IMM_CONST_BUFFER, 0, s[8:11] ; userElements[2] = IMM_CONST_BUFFER, 1, s[12:15] extUserElementCount = 0; NumVgprs = 5; NumSgprs = 18;
As you can see, the IL->ISA backend did not even bother to map the vector operations to the vector unit, it rather used the scalar unit exclusively. The first version does the 16 scalar shifts and wastes 32 SGPRs, the second one does the 4 scalar shifts and wastes 18 SGPRs. Now before you say "why is it behaving like that", there are several reasons for this. Once again, stop thinking about OpenCL as something that should map 1:1 to hardware. OpenCL is a high-level API. Even with IL, you don't have that control. You cannot directly influence how is the backend going to map on the hardware. As for your improved results, I would advise you to have a look at the python host code and/or the share rate as reported by the pool you are using. It is likely that the progress indicator is not reporting the correct speed for some reason (wrong NDRange calculation, wrong divisor or something like that). I've done those experiments in the past with mine and others' bitcoin kernels and in all cases, the kernel performance dropped abruptly with vectorization above 4 (due to reduced occupancy). Which GPU are you using specifically? It sounds like you're describing the Cayman. And my results are accurate. It's more likely that the results dropped due to register spilling or high worksize settings.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
gat3way
|
|
February 06, 2012, 11:19:03 PM |
|
You don't recognize Cayman ISA from Tahiti ISA? Well, actually this is cross-compiled using the cl_amd_offline_devices extension. It is an AMD extension that lets you compile binary kernels for all hardware supported by the driver. The system I got the dumps and built kernels on is a 6870 one. It does not matter though as the generated binary is the same as the one you would get from clBuildProgram() on 79xx.
|
|
|
|
d3m0n1q_733rz
|
|
February 06, 2012, 11:45:46 PM |
|
You don't recognize Cayman ISA from Tahiti ISA? Well, actually this is cross-compiled using the cl_amd_offline_devices extension. It is an AMD extension that lets you compile binary kernels for all hardware supported by the driver. The system I got the dumps and built kernels on is a 6870 one. It does not matter though as the generated binary is the same as the one you would get from clBuildProgram() on 79xx. Actually, I was talking about the Tahiti vs. Cayman as Tahiti uses GCN and Cayman uses VLIW. The Tahiti GPU is different from the others of the 7xxx series because it's based on the GCN architecture which contains four full 16-wide vector units. From what you've told me, the best settings for the 79xx series cards will be with 2 vectors and extremely high worksizes, but that data is based on the lower-end VLIW-based 7xxx cards in the series. Since it takes 8 cycles to complete a group on VLIW, vectorization seems to be a good option. The document you posted talks about the Southern Island cards in general, but not the specifics of the 16-wide vectors. My argument is that VLIW can use more instructions, but GCN can use more vectors. The idea I'm trying to convey is to keep the vectors high and the instructions required to be used on them low. But I can't seem to avoid the darn spillover in the registers. >_<
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
gat3way
|
|
February 07, 2012, 12:04:24 AM |
|
Nope. There is no reason extremely high (what's extreme btw? 256?) worksizes would work best. My bet would be 64 would work best. I might not be correct about this since I lack data and I lack data cause I am too lazy to profile current bitcoin kernels The dumps I posted you are from Tahiti architecture, you can see no VLIW bundles and no clauses there and the GCN ISA is clearly different from the VLIW one Also, there is no reason why 2-component vectors would work best on Tahiti. Why do you think uint2 would work best? I don't think so. It might work well. It might not. You have run that through SKA or sprofile? Once again (I am kinda tired of this so I am not going to reinstate that anymore), there is no 1:1 mapping between OpenCL vectors and the 79xx's vector ALU units. 79xx cannot "use more vectors" as "using more vectors" does not mean "use less instructions" on GCN hardware. Though frankly said I don't see a reason why am I arguing about that. Actually you are free to profile and benchmark. Again, do profile
|
|
|
|
d3m0n1q_733rz
|
|
February 07, 2012, 12:36:44 AM |
|
Nope. There is no reason extremely high (what's extreme btw? 256?) worksizes would work best. My bet would be 64 would work best. I might not be correct about this since I lack data and I lack data cause I am too lazy to profile current bitcoin kernels The dumps I posted you are from Tahiti architecture, you can see no VLIW bundles and no clauses there and the GCN ISA is clearly different from the VLIW one Also, there is no reason why 2-component vectors would work best on Tahiti. Why do you think uint2 would work best? I don't think so. It might work well. It might not. You have run that through SKA or sprofile? Once again (I am kinda tired of this so I am not going to reinstate that anymore), there is no 1:1 mapping between OpenCL vectors and the 79xx's vector ALU units. 79xx cannot "use more vectors" as "using more vectors" does not mean "use less instructions" on GCN hardware. Though frankly said I don't see a reason why am I arguing about that. Actually you are free to profile and benchmark. Again, do profile >_< That's what I'm trying to do. I'm trying to allow it to use all 16 vectors at once instead of using smaller vectors to achieve the same thing. In this way, the instructions aren't repeated and the overhead is removed. But I've seen best results while using 8 vectors and a worksize of 64 or 32.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
|
-ck
Legendary
Offline
Activity: 4242
Merit: 1644
Ruu \o/
|
|
February 07, 2012, 10:51:36 PM |
|
So how are we going diapolo? Is your kernel ready for me to port it to cgminer
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
Diapolo (OP)
|
|
February 08, 2012, 06:53:01 AM |
|
So how are we going diapolo? Is your kernel ready for me to port it to cgminer Hey Con, It's ready for getting assimilated , only thing is I really need your help for this. There are some differences in the supplied kernel variables and compiler arguments, which we should take a look at. Another difference is the output buffer, which is currently not compatible to the CGMINER code (but could be changed rather easy). I added another method of nonce calculation via OpenCL 1.1 global offset, so a flag or function to detect OpenCL 1.1 would be needed in the CGMINER API. At the end of the week I should have a bit more time, than I have now, but the phase of planning can start as soon as you give me a go (and take me by the hand ^^). What would you suggest as a first step? Dia
|
|
|
|
-ck
Legendary
Offline
Activity: 4242
Merit: 1644
Ruu \o/
|
|
February 08, 2012, 07:14:44 AM |
|
Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
-ck
Legendary
Offline
Activity: 4242
Merit: 1644
Ruu \o/
|
|
February 08, 2012, 07:46:56 AM |
|
Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
Investigate the imported diakgcn. The only significant change is to the output code, but I get no shares yet... https://github.com/ckolivas/cgminer/blob/diakgcn/diakgcn120208.cl
|
Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel 2% Fee Solo mining at solo.ckpool.org -ck
|
|
|
Diapolo (OP)
|
|
February 08, 2012, 10:15:06 AM |
|
Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
Investigate the imported diakgcn. The only significant change is to the output code, but I get no shares yet... https://github.com/ckolivas/cgminer/blob/diakgcn/diakgcn120208.clCool, I'll take a look at it ... Dia
|
|
|
|
d3m0n1q_733rz
|
|
February 08, 2012, 02:02:05 PM |
|
Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is. They're just two more instructions and register entries.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
Diapolo (OP)
|
|
February 08, 2012, 02:06:02 PM |
|
Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is. They're just two more instructions and register entries.
I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code . Thanks for sharing your idea! Dia
|
|
|
|
d3m0n1q_733rz
|
|
February 08, 2012, 02:56:31 PM |
|
Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is. They're just two more instructions and register entries.
I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code . Thanks for sharing your idea! Dia Just remember to use #ifndef GOFFSET with the constant that sets base to uu. I've also added GOFFSET to Phatk2! *Happy dance* It's tested and works.
|
Funroll_Loops, the theoretically quicker breakfast cereal! Check out http://www.facebook.com/JupiterICT for all of your computing needs. If you need it, we can get it. We have solutions for your computing conundrums. BTC accepted! 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
|
|
|
Diapolo (OP)
|
|
February 08, 2012, 03:17:28 PM |
|
Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is. They're just two more instructions and register entries.
I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code . Thanks for sharing your idea! Dia Just remember to use #ifndef GOFFSET with the constant that sets base to uu. I've also added GOFFSET to Phatk2! *Happy dance* It's tested and works. Nice work , got your version faster on your machine? Dia
|
|
|
|
|