Bitcoin Forum
March 29, 2024, 10:24:37 AM *
News: Latest Bitcoin Core release: 26.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 [3] 4 5 6 »  All
  Print  
Author Topic: DiaKGCN kernel for CGMINER + Phoenix 2 (79XX / 78XX / 77XX / GCN) - 2012-05-25  (Read 27710 times)
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 04:59:45 AM
 #41

Just so you know, the ATI cards are capable of handling up to 16 vectors that I'm aware of.  I'm not going to try this right now, but it'll supposedly cut-down on the amount of work that's required to be done.  Higher-end cards will, of course, see better results than lower-end ones.  I don't know what the physical computing size is for the data, but it'll handle int16 which should be best for dedicated rigs as long as the worksize is dropped to about half of the hardware's limit from what I see here.

I could implement uint16, should be pretty straight forward, but massive vectorisation is really something GCN does not like currently.

Dia
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/3
I think it might be because the full 16 vectors are loaded and unloaded to make room for anything else that needs to be computed.   Undecided
In theory, 16 vectors at once is the best approach, but that only applies if we're doing math for only the 16 vectors as that's the maximum the ALUs can hold.
In other words, the moment something else needs to be loaded, it has to pull the entire 512-byte integer from the ALUs to put into the cache, load the data to be computed, unload it, then reload the 512-byte integers.  But the GCN is supposedly a true 16 vector design so I think the problem is the overhead that's created loading and unloading.  With the 8 vectors, did you try the worksize of 64 to see if it was any faster?

PS Bad news--new Phoenix 2 miner.  I've suggested they make changes to the phatk2 kernel like you've made for your GCN here.  Like adding the GOFFSET option and increasing vector sizes.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1711707877
Hero Member
*
Offline Offline

Posts: 1711707877

View Profile Personal Message (Offline)

Ignore
1711707877
Reply with quote  #2

1711707877
Report to moderator
1711707877
Hero Member
*
Offline Offline

Posts: 1711707877

View Profile Personal Message (Offline)

Ignore
1711707877
Reply with quote  #2

1711707877
Report to moderator
1711707877
Hero Member
*
Offline Offline

Posts: 1711707877

View Profile Personal Message (Offline)

Ignore
1711707877
Reply with quote  #2

1711707877
Report to moderator
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 07:34:18 AM
Last edit: February 06, 2012, 09:02:30 AM by d3m0n1q_733rz
 #42

Anyhow, going back to what I was saying, Dia, I think that the best kernel design for GCN is one which will compute four 512-byte integers.  Since it can compute one in 4 cycles or 4 in 4 cycles, it seems best to attempt to compute 4 sets of 16 vectors to the fullest extent of the ALUs.  Alternatively, you could compute 3 sets and leave the last SIMD for computing other works required by the kernel such as nonce and the like.  So, multi-threading is brought into play with the GCN processors.   Cool
The problem is that these aren't multi-GPUs, these are multi-SIMD GPUs which makes coding a little more tricky.
I might be a little over-zealous to think that these are capable of handling four times the amount of mining at one time, but it seems like the approach to take.
The biggest suggestion I could make, though, is to drop the worksize down to allow for the increased vectors.  You should see some improvement with VECTORS8, but I can't promise it so.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
February 06, 2012, 10:38:18 AM
Last edit: February 06, 2012, 10:48:27 AM by gat3way
 #43

There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.

To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.

Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.


There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 11:09:05 AM
Last edit: February 06, 2012, 11:28:50 AM by d3m0n1q_733rz
 #44

There is no "native 16-component vectors support" in any AMD GPU hardware, including GCN. OpenCL vectors are just a software abstraction that does not map directly on hardware. Furthermore, hardware is not SIMD (GCN's vector ALU units are more like SIMD, but they are _not_ 16-wide nevertheless). It would be rather naive and easy if vector operations were directly mapped to hardware capabilities but it's not the case. You could for example imagine the VLIW4 or VLIW5 architecture operating as 4-wide or 5-wide SIMD unit and that sounds pretty logical, but that does not happen in reality.

To emulate 16-component vectors, VLIW bundles are generated in a way that 16 ALU operations are being performed rather than say 4. Which means that if one or two VLIW bundles were generated for 4-wide vector ALU operation, 4 or more bundles would be generated for a 16-wide vector ALU operation. The only benefit of doing this is tighter ALUPacking which is not very relevant on 6xxx. In most cases though, the difference in ALUPacking between 4-component vectors and wider ones is negligible if your code is written so that needless dependencies are eliminated.

Unfortunately though, wider vectors mean more GPRs wasted and more GPRs wasted mean less wavefronts per CU. So in most cases, wider vectors mean slower kernels due to lower occupancy. There is a nice table on the AMD APP SDK programming guide concerning the correlation of GPRs used to wavefronts/CU.


There are some cases where uint16 might in fact improve performance - like simple kernels that execute fast and time is wasted on kernel scheduling and data transfers - in that case using uint16 means more work per kernel invocation and the overall effect is better when you weight it against increased GPR usage. Bitcoin kernels though are not such a case.
Alright, but when it came to 8 vectors, you can't argue with results.  I've posted the table of gains with the VLIW5 hardware I use.  And please read the papers on the GCN again (assuming you read them once) as it's clearly stated that, "Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront.  This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN."
Now, as I was saying, since the SIMDs are 16-wide and there are 4 of them.  Each SIMD could be loaded with 16 vectors each which would allow the calculations to be run on all of them without wasting any clock cycles.  Four 16-vectors at once sounds pretty good to me.
The Cayman takes 4 clock cycles due to SPs being used.  The GCN handles them in one.  You do the math.
Now, I don't know why Dia's been getting lower hash results with 8 vectors having more ALUs to handle them.  But I have the HD5450 and I get the highest hashing rate using 8 vectors and a worksize of 64.  You can find my results on the previous page.  Oddly enough, it's on the VLIW5 which isn't 8-wide.  8-wide would be using half of the ALUs of a single SIMD on a GCN.  So what gives?

BTW, I'm talking about the 79xx series.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
February 06, 2012, 09:48:18 PM
 #45

I'm telling you again, you've gotten that wrong. The vector ALU unit on GCNs is not meant to map 1:1 with opencl's vectors. The GCN architecture is scalar in nature. The purpose of vector ALU units is to handle ALU operations that are handled per-workitem rather than those that are handled on a per-workgroup basis. The vector ALU operations take 4 cycles to execute as compared to the 1 cycle on the scalar unit. There might be an advantage to vectorization in some cases but that's not because the vector unit behaves as a 16-wide SIMD unit (which is wrong btw). The vector unit "appears" to operate as a SIMD one, but that comes at the price of the instruction latency.

There is now a section on GCN architecture on the official APP SDK documentation:

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Along with everything else, it is clearly stated there:

Quote
Notes –
•   Vectorization is no longer needed, nor desirable. If your code used to
    combine work-items in order to get better VLIW use, this is no longer
    required.


Anyway, this can be easily demonstrated. Here is a very simple OpenCL kernel that shifts a kernel argument and writes it into an output buffer. This is with uint16 vectors:


Code:
__kernel void test(uint16 in,__global uint16 *dest)
{
dest[get_global_id(0)] = in>>2;
}

Here is the ISA dump:

Code:
 s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dwordx4  s[20:23], s[12:15], 0x04           // 0000000C: C28A0D04
  s_buffer_load_dwordx4  s[24:27], s[12:15], 0x08           // 00000010: C28C0D08
  s_buffer_load_dwordx4  s[28:31], s[12:15], 0x0c           // 00000014: C28E0D0C
  s_buffer_load_dword  s2, s[12:15], 0x10                   // 00000018: C2010D10
  s_waitcnt     lgkmcnt(0)                                  // 0000001C: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000020: 93000010
  s_add_i32     s0, s0, s1                                  // 00000024: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 00000028: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 0000002C: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000030: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000034: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 00000038: 9008820B
  v_lshlrev_b32  v0, 6, v0                                  // 0000003C: 34000086
  v_add_i32     v0, vcc, s2, v0                             // 00000040: 4A000002
  v_mov_b32     v1, s0                                      // 00000044: 7E020200
  v_mov_b32     v2, s1                                      // 00000048: 7E040201
  v_mov_b32     v3, s3                                      // 0000004C: 7E060203
  v_mov_b32     v4, s8                                      // 00000050: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBF71000 80010100
  s_lshr_b32    s0, s28, 2                                  // 0000005C: 9000821C
  s_lshr_b32    s1, s29, 2                                  // 00000060: 9001821D
  s_lshr_b32    s2, s30, 2                                  // 00000064: 9002821E
  s_lshr_b32    s3, s31, 2                                  // 00000068: 9003821F
  s_waitcnt     expcnt(0)                                   // 0000006C: BF8C1F0F
  v_mov_b32     v1, s0                                      // 00000070: 7E020200
  v_mov_b32     v2, s1                                      // 00000074: 7E040201
  v_mov_b32     v3, s2                                      // 00000078: 7E060202
  v_mov_b32     v4, s3                                      // 0000007C: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:48 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000080: EBF71030 80010100
  s_lshr_b32    s0, s24, 2                                  // 00000088: 90008218
  s_lshr_b32    s1, s25, 2                                  // 0000008C: 90018219
  s_lshr_b32    s2, s26, 2                                  // 00000090: 9002821A
  s_lshr_b32    s3, s27, 2                                  // 00000094: 9003821B
  s_waitcnt     expcnt(0)                                   // 00000098: BF8C1F0F
  v_mov_b32     v1, s0                                      // 0000009C: 7E020200
  v_mov_b32     v2, s1                                      // 000000A0: 7E040201
  v_mov_b32     v3, s2                                      // 000000A4: 7E060202
  v_mov_b32     v4, s3                                      // 000000A8: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:32 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000AC: EBF71020 80010100
  s_lshr_b32    s0, s20, 2                                  // 000000B4: 90008214
  s_lshr_b32    s1, s21, 2                                  // 000000B8: 90018215
  s_lshr_b32    s2, s22, 2                                  // 000000BC: 90028216
  s_lshr_b32    s3, s23, 2                                  // 000000C0: 90038217
  s_waitcnt     expcnt(0)                                   // 000000C4: BF8C1F0F
  v_mov_b32     v1, s0                                      // 000000C8: 7E020200
  v_mov_b32     v2, s1                                      // 000000CC: 7E040201
  v_mov_b32     v3, s2                                      // 000000D0: 7E060202
  v_mov_b32     v4, s3                                      // 000000D4: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000D8: EBF71010 80010100
  s_endpgm                                                  // 000000E0: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 228;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 32;



Now there is the uint4 version:


Code:
__kernel void test(uint4 in,__global uint4 *dest)
{
dest[get_global_id(0)] = in>>2;
}


Code:
 s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dword  s2, s[12:15], 0x04                   // 0000000C: C2010D04
  s_waitcnt     lgkmcnt(0)                                  // 00000010: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000014: 93000010
  s_add_i32     s0, s0, s1                                  // 00000018: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 0000001C: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 00000020: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000024: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000028: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 0000002C: 9008820B
  v_lshlrev_b32  v0, 4, v0                                  // 00000030: 34000084
  v_add_i32     v0, vcc, s2, v0                             // 00000034: 4A000002
  v_mov_b32     v1, s0                                      // 00000038: 7E020200
  v_mov_b32     v2, s1                                      // 0000003C: 7E040201
  v_mov_b32     v3, s3                                      // 00000040: 7E060203
  v_mov_b32     v4, s8                                      // 00000044: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000048: EBF71000 80010100
  s_endpgm                                                  // 00000050: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 84;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 18;



As you can see, the IL->ISA backend did not even bother to map the vector operations to the vector unit, it rather used the scalar unit exclusively. The first version does the 16 scalar shifts and wastes 32 SGPRs, the second one does the 4 scalar shifts and wastes 18 SGPRs.

Now before you say "why is it behaving like that", there are several reasons for this. Once again, stop thinking about OpenCL as something that should map 1:1 to hardware. OpenCL is a high-level API. Even with IL, you don't have that control. You cannot directly influence how is the backend going to map on the hardware.

As for your improved results, I would advise you to have a look at the python host code and/or the share rate as reported by the pool you are using. It is likely that the progress indicator is not reporting the correct speed for some reason (wrong NDRange calculation, wrong divisor or something like that). I've done those experiments in the past with mine and others' bitcoin kernels and in all cases, the kernel performance dropped abruptly with vectorization above 4 (due to reduced occupancy).
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 11:10:38 PM
 #46

I'm telling you again, you've gotten that wrong. The vector ALU unit on GCNs is not meant to map 1:1 with opencl's vectors. The GCN architecture is scalar in nature. The purpose of vector ALU units is to handle ALU operations that are handled per-workitem rather than those that are handled on a per-workgroup basis. The vector ALU operations take 4 cycles to execute as compared to the 1 cycle on the scalar unit. There might be an advantage to vectorization in some cases but that's not because the vector unit behaves as a 16-wide SIMD unit (which is wrong btw). The vector unit "appears" to operate as a SIMD one, but that comes at the price of the instruction latency.

There is now a section on GCN architecture on the official APP SDK documentation:

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Along with everything else, it is clearly stated there:

Quote
Notes –
•   Vectorization is no longer needed, nor desirable. If your code used to
    combine work-items in order to get better VLIW use, this is no longer
    required.


Anyway, this can be easily demonstrated. Here is a very simple OpenCL kernel that shifts a kernel argument and writes it into an output buffer. This is with uint16 vectors:


Code:
__kernel void test(uint16 in,__global uint16 *dest)
{
dest[get_global_id(0)] = in>>2;
}

Here is the ISA dump:

Code:
 s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dwordx4  s[20:23], s[12:15], 0x04           // 0000000C: C28A0D04
  s_buffer_load_dwordx4  s[24:27], s[12:15], 0x08           // 00000010: C28C0D08
  s_buffer_load_dwordx4  s[28:31], s[12:15], 0x0c           // 00000014: C28E0D0C
  s_buffer_load_dword  s2, s[12:15], 0x10                   // 00000018: C2010D10
  s_waitcnt     lgkmcnt(0)                                  // 0000001C: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000020: 93000010
  s_add_i32     s0, s0, s1                                  // 00000024: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 00000028: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 0000002C: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000030: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000034: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 00000038: 9008820B
  v_lshlrev_b32  v0, 6, v0                                  // 0000003C: 34000086
  v_add_i32     v0, vcc, s2, v0                             // 00000040: 4A000002
  v_mov_b32     v1, s0                                      // 00000044: 7E020200
  v_mov_b32     v2, s1                                      // 00000048: 7E040201
  v_mov_b32     v3, s3                                      // 0000004C: 7E060203
  v_mov_b32     v4, s8                                      // 00000050: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBF71000 80010100
  s_lshr_b32    s0, s28, 2                                  // 0000005C: 9000821C
  s_lshr_b32    s1, s29, 2                                  // 00000060: 9001821D
  s_lshr_b32    s2, s30, 2                                  // 00000064: 9002821E
  s_lshr_b32    s3, s31, 2                                  // 00000068: 9003821F
  s_waitcnt     expcnt(0)                                   // 0000006C: BF8C1F0F
  v_mov_b32     v1, s0                                      // 00000070: 7E020200
  v_mov_b32     v2, s1                                      // 00000074: 7E040201
  v_mov_b32     v3, s2                                      // 00000078: 7E060202
  v_mov_b32     v4, s3                                      // 0000007C: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:48 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000080: EBF71030 80010100
  s_lshr_b32    s0, s24, 2                                  // 00000088: 90008218
  s_lshr_b32    s1, s25, 2                                  // 0000008C: 90018219
  s_lshr_b32    s2, s26, 2                                  // 00000090: 9002821A
  s_lshr_b32    s3, s27, 2                                  // 00000094: 9003821B
  s_waitcnt     expcnt(0)                                   // 00000098: BF8C1F0F
  v_mov_b32     v1, s0                                      // 0000009C: 7E020200
  v_mov_b32     v2, s1                                      // 000000A0: 7E040201
  v_mov_b32     v3, s2                                      // 000000A4: 7E060202
  v_mov_b32     v4, s3                                      // 000000A8: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:32 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000AC: EBF71020 80010100
  s_lshr_b32    s0, s20, 2                                  // 000000B4: 90008214
  s_lshr_b32    s1, s21, 2                                  // 000000B8: 90018215
  s_lshr_b32    s2, s22, 2                                  // 000000BC: 90028216
  s_lshr_b32    s3, s23, 2                                  // 000000C0: 90038217
  s_waitcnt     expcnt(0)                                   // 000000C4: BF8C1F0F
  v_mov_b32     v1, s0                                      // 000000C8: 7E020200
  v_mov_b32     v2, s1                                      // 000000CC: 7E040201
  v_mov_b32     v3, s2                                      // 000000D0: 7E060202
  v_mov_b32     v4, s3                                      // 000000D4: 7E080203
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen offset:16 format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 000000D8: EBF71010 80010100
  s_endpgm                                                  // 000000E0: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 228;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 32;



Now there is the uint4 version:


Code:
__kernel void test(uint4 in,__global uint4 *dest)
{
dest[get_global_id(0)] = in>>2;
}


Code:
 s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dwordx4  s[8:11], s[12:15], 0x00            // 00000008: C2840D00
  s_buffer_load_dword  s2, s[12:15], 0x04                   // 0000000C: C2010D04
  s_waitcnt     lgkmcnt(0)                                  // 00000010: BF8C007F
  s_mul_i32     s0, s16, s0                                 // 00000014: 93000010
  s_add_i32     s0, s0, s1                                  // 00000018: 81000100
  v_add_i32     v0, vcc, s0, v0                             // 0000001C: 4A000000
  s_lshr_b32    s0, s8, 2                                   // 00000020: 90008208
  s_lshr_b32    s1, s9, 2                                   // 00000024: 90018209
  s_lshr_b32    s3, s10, 2                                  // 00000028: 9003820A
  s_lshr_b32    s8, s11, 2                                  // 0000002C: 9008820B
  v_lshlrev_b32  v0, 4, v0                                  // 00000030: 34000084
  v_add_i32     v0, vcc, s2, v0                             // 00000034: 4A000002
  v_mov_b32     v1, s0                                      // 00000038: 7E020200
  v_mov_b32     v2, s1                                      // 0000003C: 7E040201
  v_mov_b32     v3, s3                                      // 00000040: 7E060203
  v_mov_b32     v4, s8                                      // 00000044: 7E080208
  tbuffer_store_format_xyzw  v[1:4], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32_32_32,BUF_NUM_FORMAT_FLOAT] // 00000048: EBF71000 80010100
  s_endpgm                                                  // 00000050: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 84;Bytes

userElementCount     = 3;
;  userElements[0]    = IMM_UAV, 10, s[4:7]
;  userElements[1]    = IMM_CONST_BUFFER, 0, s[8:11]
;  userElements[2]    = IMM_CONST_BUFFER, 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 5;
NumSgprs             = 18;



As you can see, the IL->ISA backend did not even bother to map the vector operations to the vector unit, it rather used the scalar unit exclusively. The first version does the 16 scalar shifts and wastes 32 SGPRs, the second one does the 4 scalar shifts and wastes 18 SGPRs.

Now before you say "why is it behaving like that", there are several reasons for this. Once again, stop thinking about OpenCL as something that should map 1:1 to hardware. OpenCL is a high-level API. Even with IL, you don't have that control. You cannot directly influence how is the backend going to map on the hardware.

As for your improved results, I would advise you to have a look at the python host code and/or the share rate as reported by the pool you are using. It is likely that the progress indicator is not reporting the correct speed for some reason (wrong NDRange calculation, wrong divisor or something like that). I've done those experiments in the past with mine and others' bitcoin kernels and in all cases, the kernel performance dropped abruptly with vectorization above 4 (due to reduced occupancy).
Which GPU are you using specifically?  It sounds like you're describing the Cayman.  And my results are accurate.  It's more likely that the results dropped due to register spilling or high worksize settings.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
February 06, 2012, 11:19:03 PM
 #47

You don't recognize Cayman ISA from Tahiti ISA? Smiley

Well, actually this is cross-compiled using the cl_amd_offline_devices extension. It is an AMD extension that lets you compile binary kernels for all hardware supported by the driver. The system I got the dumps and built kernels on is a 6870 one. It does not matter though as the generated binary is the same as the one you would get from clBuildProgram() on 79xx.
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 06, 2012, 11:45:46 PM
 #48

You don't recognize Cayman ISA from Tahiti ISA? Smiley

Well, actually this is cross-compiled using the cl_amd_offline_devices extension. It is an AMD extension that lets you compile binary kernels for all hardware supported by the driver. The system I got the dumps and built kernels on is a 6870 one. It does not matter though as the generated binary is the same as the one you would get from clBuildProgram() on 79xx.
Actually, I was talking about the Tahiti vs. Cayman as Tahiti uses GCN and Cayman uses VLIW.  The Tahiti GPU is different from the others of the 7xxx series because it's based on the GCN architecture which contains four full 16-wide vector units.  From what you've told me, the best settings for the 79xx series cards will be with 2 vectors and extremely high worksizes, but that data is based on the lower-end VLIW-based 7xxx cards in the series.  Since it takes 8 cycles to complete a group on VLIW, vectorization seems to be a good option.
The document you posted talks about the Southern Island cards in general, but not the specifics of the 16-wide vectors.
My argument is that VLIW can use more instructions, but GCN can use more vectors.  The idea I'm trying to convey is to keep the vectors high and the instructions required to be used on them low.  But I can't seem to avoid the darn spillover in the registers.  >_<

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
gat3way
Sr. Member
****
Offline Offline

Activity: 256
Merit: 250


View Profile
February 07, 2012, 12:04:24 AM
 #49

Nope. There is no reason extremely high (what's extreme btw? 256?) worksizes would work best. My bet would be 64 would work best. I might not be correct about this since I lack data and I lack data cause I am too lazy to profile current bitcoin kernels Smiley

The dumps I posted you are from Tahiti architecture, you can see no VLIW bundles and no clauses there and the GCN ISA is clearly different from the VLIW one Smiley Also, there is no reason why 2-component vectors would work best on Tahiti. Why do you think uint2 would work best? I don't think so. It might work well. It might not. You have run that through SKA or sprofile?

Once again (I am kinda tired of this so I am not going to reinstate that anymore), there is no 1:1 mapping between OpenCL vectors and the 79xx's vector ALU units. 79xx cannot "use more vectors" as "using more vectors" does not mean "use less instructions" on GCN hardware. Though frankly said I don't see a reason why am I arguing about that. Actually you are free to profile and benchmark. Again, do profile Smiley

d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 07, 2012, 12:36:44 AM
 #50

Nope. There is no reason extremely high (what's extreme btw? 256?) worksizes would work best. My bet would be 64 would work best. I might not be correct about this since I lack data and I lack data cause I am too lazy to profile current bitcoin kernels Smiley

The dumps I posted you are from Tahiti architecture, you can see no VLIW bundles and no clauses there and the GCN ISA is clearly different from the VLIW one Smiley Also, there is no reason why 2-component vectors would work best on Tahiti. Why do you think uint2 would work best? I don't think so. It might work well. It might not. You have run that through SKA or sprofile?

Once again (I am kinda tired of this so I am not going to reinstate that anymore), there is no 1:1 mapping between OpenCL vectors and the 79xx's vector ALU units. 79xx cannot "use more vectors" as "using more vectors" does not mean "use less instructions" on GCN hardware. Though frankly said I don't see a reason why am I arguing about that. Actually you are free to profile and benchmark. Again, do profile Smiley


>_<  That's what I'm trying to do.  I'm trying to allow it to use all 16 vectors at once instead of using smaller vectors to achieve the same thing.  In this way, the instructions aren't repeated and the overhead is removed.  But I've seen best results while using 8 vectors and a worksize of 64 or 32.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 07, 2012, 03:49:26 PM
 #51

If you guys did not see it, there is a new Phoenix 2 beta for which I released a DiaKGCN preview, see here:
https://bitcointalk.org/index.php?topic=62765.msg734465#msg734465

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
-ck
Legendary
*
Offline Offline

Activity: 4060
Merit: 1622


Ruu \o/


View Profile WWW
February 07, 2012, 10:51:36 PM
 #52

So how are we going diapolo? Is your kernel ready for me to port it to cgminer Wink

Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel
2% Fee Solo mining at solo.ckpool.org
-ck
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 06:53:01 AM
 #53

So how are we going diapolo? Is your kernel ready for me to port it to cgminer Wink

Hey Con,

It's ready for getting assimilated Cheesy, only thing is I really need your help for this. There are some differences in the supplied kernel variables and compiler arguments, which we should take a look at. Another difference is the output buffer, which is currently not compatible to the CGMINER code (but could be changed rather easy). I added another method of nonce calculation via OpenCL 1.1 global offset, so a flag or function to detect OpenCL 1.1 would be needed in the CGMINER API.

At the end of the week I should have a bit more time, than I have now, but the phase of planning can start as soon as you give me a go (and take me by the hand ^^). What would you suggest as a first step?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
-ck
Legendary
*
Offline Offline

Activity: 4060
Merit: 1622


Ruu \o/


View Profile WWW
February 08, 2012, 07:14:44 AM
 #54

Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.

Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel
2% Fee Solo mining at solo.ckpool.org
-ck
-ck
Legendary
*
Offline Offline

Activity: 4060
Merit: 1622


Ruu \o/


View Profile WWW
February 08, 2012, 07:46:56 AM
 #55

Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
Investigate the imported diakgcn. The only significant change is to the output code, but I get no  shares yet...
https://github.com/ckolivas/cgminer/blob/diakgcn/diakgcn120208.cl

Developer/maintainer for cgminer, ckpool/ckproxy, and the -ck kernel
2% Fee Solo mining at solo.ckpool.org
-ck
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 10:15:06 AM
 #56

Check my git tree. There's a diakgcn branch now. Only the output side is different, but I can't get shares out of it yet. No support for new macros yet either.
Investigate the imported diakgcn. The only significant change is to the output code, but I get no  shares yet...
https://github.com/ckolivas/cgminer/blob/diakgcn/diakgcn120208.cl

Cool, I'll take a look at it ...

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 08, 2012, 02:02:05 PM
 #57

Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 02:06:02 PM
 #58

Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250



View Profile WWW
February 08, 2012, 02:56:31 PM
 #59

Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia
Just remember to use #ifndef GOFFSET with the constant that sets base to uu.
I've also added GOFFSET to Phatk2!  *Happy dance*

It's tested and works.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
Diapolo (OP)
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
February 08, 2012, 03:17:28 PM
 #60

Hey, look into removing the base and uu int by using #if GOFFSET=false as uu and base are not used when GOFFSET is.  They're just two more instructions and register entries.

I had that in mind, because if GOFFSET is active, base is simply unused and can be removed from the code Smiley. Thanks for sharing your idea!

Dia
Just remember to use #ifndef GOFFSET with the constant that sets base to uu.
I've also added GOFFSET to Phatk2!  *Happy dance*

It's tested and works.

Nice work Smiley, got your version faster on your machine?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Pages: « 1 2 [3] 4 5 6 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!