Bitcoin Forum
December 05, 2016, 06:44:16 PM *
News: Latest stable version of Bitcoin Core: 0.13.1  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 [14] 15 16 »  All
  Print  
Author Topic: Modified Kernel for Phoenix 1.5  (Read 92139 times)
huayra.agera
Full Member
***
Offline Offline

Activity: 154



View Profile
August 12, 2011, 08:19:20 PM
 #261

Hi, I used phatk 2.2 on my 5 rigs and I had restarting/BSOD errors occuring on all machines (5850 multi/single, 6850) on several occasions already.

Yes, there was an increase in hashrate however, it seemed to have a memory leak or something. Just thought I'd inform you on this. Anyways, great work still. Looking forward to further improvements on the project. But for now, I'll revert to my previous settings.

BTC: 1JMPScxohom4MXy9X1Vgj8AGwcHjT8XTuy
1480963456
Hero Member
*
Offline Offline

Posts: 1480963456

View Profile Personal Message (Offline)

Ignore
1480963456
Reply with quote  #2

1480963456
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1480963456
Hero Member
*
Offline Offline

Posts: 1480963456

View Profile Personal Message (Offline)

Ignore
1480963456
Reply with quote  #2

1480963456
Report to moderator
jedi95
Full Member
***
Offline Offline

Activity: 219


View Profile
August 13, 2011, 06:38:02 AM
 #262


I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).


Looking forward to this !!

Just sent one coin your way, and there's another once the work is done.

Quote
We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Out of curiosity, have you looked into trying to code a version
directly in AMD's assembly language and bypassing OpenCL entirely ?
(I'm thinking: since we're already patching the ELF output, this seems
like the logical next step Smiley)

Also, have you looked at AMD CAL ? I know this is what ufasoft's miner
uses (https://bitcointalk.org/index.php?topic=3486.500), and also what
zorinaq considers the most efficient way to access AMD hardware (somwhere
on http://blog.zorinaq.com)



Replacing one instruction in the ELF with another that uses the exact same inputs/outputs is one thing, but manually editing the ASM code is another thing entirely. Besides, with the work that has been done the GPU is already at >99% of the theoretical maximum throughput. (ALU packing) And as said above, we are also close to the theoretical minimum number of instructions to correctly run SHA256.

Also, if you look near the end of the hdminer thread you will notice that users are able to get the same hashrates from phatk on 69xx. For 58xx and other VLIW5 cards phatk is significantly faster than hdminer. If that's the best he can do with CAL then I don't see any reason to use it. hdminer had a substantial performance advantage back in March/April, but with basically every miner supporting BFI_INT this is no longer the case.

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU
kano
Legendary
*
Offline Offline

Activity: 1918


Linux since 1997 RedHat 4


View Profile
August 14, 2011, 03:03:42 AM
 #263

Well I've been talking to a few people about this but got no real response from anyone, that it was possible ...
(Woke up with this idea back on the 4th of August ...)

So I guess I need to post in a thread where someone works on a CL kernel and just let them implement it if they don't already do it Tongue

I've written it in pseudo-code coz I still don't follow how the CL file actually does 2^n checks and returns the full list of valid results.
Yeah I've programmed in almost every language known to man (except C# and that's avoided by choice) but I still don't quite get the interface from C/C++ to the CL and how that matches what happens

What I am discussing, is the 2nd call to SHA256 with the output of the first call (not the first call)

Anyway, to explain, here's the end of the SHA256 pseudo code from the wikipedia:
==================
  for i from 0 to 63
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]

    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
    b := a
    a := t1 + t2

  Add this chunk's hash to result:
  h0 := h0 + a
  h1 := h1 + b
  h2 := h2 + c
  h3 := h3 + d
  h4 := h4 + e
  h5 := h5 + f
  h6 := h6 + g
  h7 := h7 + h

Then test if h0..h7 is a share (CHECK0, CHECK1, ?)
==================

Firstly, I added that last line of course.
I understand that with current difficulty, if h0 != 0 then we don't have a share (call this CHECK0)
If h0=0 then check some leading part of h1 based on the current difficulty (call this CHECK1)
... feel free to correct this anyone who knows better Smiley

If a difficulty actually gets to checking h2 then my optimisation can be made even better by going back one more step (adding an i := 61) in the pseudo code shown below

A reasonably simple optimisation of the end code for when we are about to check if h0..h7 is a share (i.e. only the 2nd hash)

==================
 for i from 0 to 61
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]

    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
    b := a
    a := t1 + t2

 i := 62
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]

 tmpa := t1 + t2
 tmpb := h1 + tmpa (this is the actual value of h1 at the end)
 if CHECK1 on tmpb then abort - not a share
  (i.e. return false for a share)

    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
    b := a
    a := tmpa

 i := 63
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]

 tmpa := h0 + t1 + t2 (this is the actual value of h0 at the end)
 if CHECK0 on tmpa then abort - not a share
  (i.e. return false for a share)

    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b

 Add this chunk's hash to result:
 h0 := tmpa
 h1 := tmpb
 h2 := h2 + c
 h3 := h3 + d
 h4 := h4 + e
 h5 := h5 + f
 h6 := h6 + g
 h7 := h7 + h

Its a share - unless we need to test h2?
==================

Firstly the obvious (as I've said twice above):
This should only be done when calculating a hash to be tested as a share.
Since the actual process is a double-hash, the first hash should not, of course, do this.

In i=62:
If the tmpb test (CHECK1) says it isn't a share it avoids an entire loop (i=63), the 'e' calculation at i=62 and any unneeded assignments after that
and also we don't care about the actual values of h0-h7 so there is no need to assign them anything (or do the additions) except whatever is needed to affirm the result is not a share (e.g. set h0=-1 if h0..h7 must be examined later - or just return false if that is good enough - I don't know which the code actually needs)

CHECK1's probability of failure is high so it easily cover the issue of an extra calculation (h1 + tmpa) to do it.

In i=63:
If the tmpa test (CHECK0) says it isn't a share it avoids the 'e' calculation at i=63 and any unneeded assigments after that
and also we don't care about the actual values of h0-h7 so there is no need to assign them anything (or do the additions) except whatever is needed to affirm the result is not a share (e.g. set h0=-1 if h0..h7 must be examined later - or just return false if that is good enough - I don't know which the code actually needs)


P.S. any and all mistakes I've made - oh well but the concept is there anyway


Any mistakes? Comments?

Pool: https://kano.is BTC: 1KanoiBupPiZfkwqB7rfLXAzPnoTshAVmb
CKPool and CGMiner developer, IRC FreeNode #ckpool and #cgminer kanoi
Help keep Bitcoin secure by mining on pools with Stratum, the best protocol to mine Bitcoins with ASIC hardware
fpgaminer
Hero Member
*****
Offline Offline

Activity: 546



View Profile WWW
August 14, 2011, 10:11:42 AM
 #264

I've compiled a Win32 EXE for my poclbm fork (which has phatk, phatk2, phatk2.1, and phatk2.2 support):

http://www.bitcoin-mining.com/poclbm-progranism-win32-20110814a.zip
md5sum - df623a45f8cb0a50fcded92728f12c14

Let me know if it works, I was only able to test it on one machine so far.

Quote
Well I've been talking to a few people about this but got no real response from anyone, that it was possible ...
The optimization you've spelled out is more or less already implemented in most, if not all GPU miners.

The way GPU miners currently work is that they check in the GPU code whether h7==0. If it does, the result (a nonce) is returned, otherwise nothing is returned. It is the responsibility of the CPU software to do any further difficulty checks if needed.

Since the only thing the GPU miners care about is H7, they completely skip the last 3 rounds (stopping after the 61st round).

Also note, that GPU miners don't calculate the first 3 rounds of the first pass. Those rounds are pre-computed, because the inputs to those rounds remains constant for a given unit of getwork. So a GPU miner really only computes a grand total of 122 rounds, minus various other small pre-calculations here and there.

Clipse
Hero Member
*****
Offline Offline

Activity: 504


View Profile
August 14, 2011, 10:57:41 AM
 #265

You may be one, but you are the champion of many Tongue

Its working great on my lazy spare windows machine, thanks Smiley

...In the land of the stale, the man with one share is king... >> Clipse

We pay miners at 130% PPS | Signup here : Bonus PPS Pool (Please read OP to understand the current process)
kano
Legendary
*
Offline Offline

Activity: 1918


Linux since 1997 RedHat 4


View Profile
August 14, 2011, 11:47:00 AM
 #266

...
Quote
Well I've been talking to a few people about this but got no real response from anyone, that it was possible ...
The optimization you've spelled out is more or less already implemented in most, if not all GPU miners.

The way GPU miners currently work is that they check in the GPU code whether h7==0. If it does, the result (a nonce) is returned, otherwise nothing is returned. It is the responsibility of the CPU software to do any further difficulty checks if needed.

Since the only thing the GPU miners care about is H7, they completely skip the last 3 rounds (stopping after the 61st round).

Also note, that GPU miners don't calculate the first 3 rounds of the first pass. Those rounds are pre-computed, because the inputs to those rounds remains constant for a given unit of getwork. So a GPU miner really only computes a grand total of 122 rounds, minus various other small pre-calculations here and there.
OK, so I've got the H's back-to-front (H7 is the first one, not H0) then yeah that makes sense of doing fewer steps yet again than what I said.
Still, why not do the share/H6 test in GPU - it would certainly be faster - shares are also rare compared to a job (about 1 in 2 billion)
Is that an issue with the CL not being able to be changed based on the difficulty?
Yet it could be done as a simple pre-calculated number to AND against the H6 value (extra calculation) when H7 is zero.
(I should work out what's the difficulty value high enough to need to test H5 ... though that may be so large it would never be reached)

Edit: of course if a nonce (H7=0) is the requirement of a share - then there is no more testing (no testing of H6) required
I need to read pushpool more closely to determine exactly what a share is ... unless someone feels like answering that ... Smiley

Edit2: so skipping the first 3 rounds of the first pass is possible (128 - 3 = 125)
but there are actually 3.5 rounds not needed at the end of the 2nd pass - though I guess you already do that
Round 60 (2nd round) becomes only the calculations necessary to get t1 (s1 & ch) since unneeded are s0 and maj (and of course t2)

Pool: https://kano.is BTC: 1KanoiBupPiZfkwqB7rfLXAzPnoTshAVmb
CKPool and CGMiner developer, IRC FreeNode #ckpool and #cgminer kanoi
Help keep Bitcoin secure by mining on pools with Stratum, the best protocol to mine Bitcoins with ASIC hardware
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 14, 2011, 04:05:07 PM
 #267

It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?

Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?

Edit: Could be my setup if no one else has this error Cheesy.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
techwtf
Full Member
***
Offline Offline

Activity: 140


View Profile
August 14, 2011, 04:46:33 PM
 #268

One of my cards (5850, 835 MHz. down clock to 810M still failed) seems not able to work well with phatk 2.1/2.2. it die after a while(<1h), having to restart miner(win32)/reset(linux).
diabolo's 2011.7.17 is ok @ 835MHz.
BOARBEAR
Member
**
Offline Offline

Activity: 77


View Profile
August 14, 2011, 05:58:50 PM
 #269

I tried to figure out the reason version 2.2 does not work well with VECTORS4
I could not find out why as I do not have enough knowledge.
Here are some results I found:

replacing this block of code in version 2.1 with the corresponding block in version 2.2 will make VECTORS4 much slower


#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define P3(x)  W[x-7]
#define P4(x)  W[x-16]


//Partial Calcs for constant W values
#define P1C(n) ((rotate(ConstW[(n)-2],15)^rotate(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotate(ConstW[(n)-15],25)^rotate(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n)  Vals[(3 + 128 - (n)) % 8] += t1W(n); Vals[(7 + 128 - (n)) % 8] = t1W(n) + t2(n);  

//SHA round without W calc
#define sharound(n) Vals[(3 + 128 - (n)) % 8] += t1(n); Vals[(7 + 128 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n) Barrier(n); Vals[(3 + 128 - (n)) % 8] += t1C(n); Vals[(7 + 128 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order
#define Barrier(n) t1 = t1C((n) % 64)

And this block is not the only thing that causes the problem.

I am guessing there is something to do with rotC function.(it is a guess only
fpgaminer
Hero Member
*****
Offline Offline

Activity: 546



View Profile WWW
August 14, 2011, 06:30:05 PM
 #270

Quote
Still, why not do the share/H6 test in GPU - it would certainly be faster - shares are also rare compared to a job (about 1 in 2 billion)
Is that an issue with the CL not being able to be changed based on the difficulty?
There are several reasons.

99.99% of the time the mining software only needs to look for Difficulty 1 (a share, H7==0), so there is rarely the needed to check for anything else.
GPU's absolutely hate branching; a full Difficulty check involves many branches.
Smaller GPU programs are better GPU programs.
The CPU runs in parallel to the GPU. Since the CPU is fully capable of checking for extra Difficulty levels, why would you burden the GPU with such work?
The CPU should double-check the GPU's results anyway, to detect errors. Since the CPU will thus be recomputing the full two SHA-256 passes for each result returned by the GPU, it again makes sense to only check for higher difficulties on the CPU.

Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 14, 2011, 07:47:31 PM
 #271

It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?

Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?

Edit and solved, non BFI_INT Ch has to be:
Code:
#define Ch(x, y, z) bitselect(z, y, x)

If you want to thank someone, you can donate to 1LY4hGSY6rRuL7BQ8cjUhP2JFHFrPp5JVe (Vince -> who did a GREAT job during my kernel development)!

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
RoadStress
Legendary
*
Offline Offline

Activity: 1470


View Profile
August 15, 2011, 01:24:41 PM
 #272

Sent another donation your way.  Look forward to your work on cgminer.
+1

iCEBREAKER is a troll! He and cypherdoc helped HashFast scam 50 Million $ from its customers !
H/w Hosting Directory & Reputation - https://bitcointalk.org/index.php?topic=622998.0
Phateus
Jr. Member
*
Offline Offline

Activity: 52


View Profile
August 15, 2011, 05:54:59 PM
 #273

It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?

Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?

Edit and solved, non BFI_INT Ch has to be:
Code:
#define Ch(x, y, z) bitselect(z, y, x)

If you want to thank someone, you can donate to 1LY4hGSY6rRuL7BQ8cjUhP2JFHFrPp5JVe (Vince -> who did a GREAT job during my kernel development)!

Dia

Awesome, thank you!  I was under the assumption that BFI_INT and bitselect were the same operation, apparently, the operand order is different.  I will fix it in my next release.

Thank you everyone for your support (both in BTC and discussion).

I should have a drop-in version of the kernel available for cgminer soon, so anyone wanting to try out the pre-release, I'll be posting it tonight.

@BOARBEAR
*sigh*.... come on man... do you even read my posts? There is no single cause of the bad performance.  2.2 executes less instructions and uses less registers than 2.1, but as I said... there is some weird issue which makes openCL slower behind the scenes.  My best guess is that it has to do with register allocation. 

The GPU has a total of 256x32x4 registers (8192 UINT4).  At the most, there are 256 threads per workgroup (8192/256 = 32 registers per thread).  Using VECTORS, the number of registers is far below this number, therefore the hardware can operate on the maximum allowable threads at a time.  However, when you compile with VECTORS4, there is more than 32 registers per thread.  OpenCL must determine how to allocate the threads, and the utilization of the video card is sub-optimal)  Below is a diagram of what I think is going on.


4 thread groups running simultaneously VECTORS (2 running at a time)
[1111111122222222]
[3333333344444444]

using an optimal version of VECTORS4, it would look much like this (double the work is done per thread)
[1111111111111111]
[2222222222222222]
[3333333333333333]
[4444444444444444]

now making it use slightly less resources will make it slower because the threads are out of sync and there will be overhead in syncing and tracking data within threadgroups:
[1111111111111112]
[2222222222222233]
[3333333333333444]
[4444444444445555]

Now, I may be waaaaay off here, but something like this is what makes sense to me.  Especially, since this would explain why decreasing the memory actually improves performance in some cases (by forcing synchronization).

Anyway, enough of my off-topic analysis...


I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).


Looking forward to this !!

Just sent one coin your way, and there's another once the work is done.

Quote
We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Out of curiosity, have you looked into trying to code a version
directly in AMD's assembly language and bypassing OpenCL entirely ?
(I'm thinking: since we're already patching the ELF output, this seems
like the logical next step Smiley)

Also, have you looked at AMD CAL ? I know this is what ufasoft's miner
uses (https://bitcointalk.org/index.php?topic=3486.500), and also what
zorinaq considers the most efficient way to access AMD hardware (somwhere
on http://blog.zorinaq.com)



Replacing one instruction in the ELF with another that uses the exact same inputs/outputs is one thing, but manually editing the ASM code is another thing entirely. Besides, with the work that has been done the GPU is already at >99% of the theoretical maximum throughput. (ALU packing) And as said above, we are also close to the theoretical minimum number of instructions to correctly run SHA256.

Also, if you look near the end of the hdminer thread you will notice that users are able to get the same hashrates from phatk on 69xx. For 58xx and other VLIW5 cards phatk is significantly faster than hdminer. If that's the best he can do with CAL then I don't see any reason to use it. hdminer had a substantial performance advantage back in March/April, but with basically every miner supporting BFI_INT this is no longer the case.

Agreed, the kernel itself is pretty optimal.  I might look into calling lower level CAL functions to manage the (OpenCL compiled) GPU threads (instead of using openCL), but I doubt this will give any speedup (although, I might be able to reduce the CPU overhead).

http://deepbit.net/userbar/4dcec4d1816197e144000002_bfe143123a.png

Feeling Generous?
124RraPqYcEpX5qFcQ2ZBVD9MqUamfyQnv
Phateus
Jr. Member
*
Offline Offline

Activity: 52


View Profile
August 16, 2011, 02:35:42 AM
 #274

Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass.. trying a full cygwin install next...

Bear with me, hopefully I'll get it running tomorrow.

-Phateus

http://deepbit.net/userbar/4dcec4d1816197e144000002_bfe143123a.png

Feeling Generous?
124RraPqYcEpX5qFcQ2ZBVD9MqUamfyQnv
-ck
Moderator
Legendary
*
Offline Offline

Activity: 1988


Ruu \o/


View Profile WWW
August 16, 2011, 12:07:38 PM
 #275

Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass.. trying a full cygwin install next...

Bear with me, hopefully I'll get it running tomorrow.

-Phateus
You could just tell me what to do to interface it with cgminer (i.e. what new variables you want) and I'd copy most of your kernel across. Only the return code and define macros are actually different in cgminer in the kernel itself.

Primary developer/maintainer for cgminer and ckpool/ckproxy.
Pooled mine at kano.is, solo mine at solo.ckpool.org
-ck
BOARBEAR
Member
**
Offline Offline

Activity: 77


View Profile
August 16, 2011, 03:20:46 PM
 #276

It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?

Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?

Edit and solved, non BFI_INT Ch has to be:
Code:
#define Ch(x, y, z) bitselect(z, y, x)

If you want to thank someone, you can donate to 1LY4hGSY6rRuL7BQ8cjUhP2JFHFrPp5JVe (Vince -> who did a GREAT job during my kernel development)!

Dia

Awesome, thank you!  I was under the assumption that BFI_INT and bitselect were the same operation, apparently, the operand order is different.  I will fix it in my next release.

Thank you everyone for your support (both in BTC and discussion).

I should have a drop-in version of the kernel available for cgminer soon, so anyone wanting to try out the pre-release, I'll be posting it tonight.

@BOARBEAR
*sigh*.... come on man... do you even read my posts? There is no single cause of the bad performance.  2.2 executes less instructions and uses less registers than 2.1, but as I said... there is some weird issue which makes openCL slower behind the scenes.  My best guess is that it has to do with register allocation. 

The GPU has a total of 256x32x4 registers (8192 UINT4).  At the most, there are 256 threads per workgroup (8192/256 = 32 registers per thread).  Using VECTORS, the number of registers is far below this number, therefore the hardware can operate on the maximum allowable threads at a time.  However, when you compile with VECTORS4, there is more than 32 registers per thread.  OpenCL must determine how to allocate the threads, and the utilization of the video card is sub-optimal)  Below is a diagram of what I think is going on.


4 thread groups running simultaneously VECTORS (2 running at a time)
[1111111122222222]
[3333333344444444]

using an optimal version of VECTORS4, it would look much like this (double the work is done per thread)
[1111111111111111]
[2222222222222222]
[3333333333333333]
[4444444444444444]

now making it use slightly less resources will make it slower because the threads are out of sync and there will be overhead in syncing and tracking data within threadgroups:
[1111111111111112]
[2222222222222233]
[3333333333333444]
[4444444444445555]

Now, I may be waaaaay off here, but something like this is what makes sense to me.  Especially, since this would explain why decreasing the memory actually improves performance in some cases (by forcing synchronization).

Anyway, enough of my off-topic analysis...


I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).


Looking forward to this !!

Just sent one coin your way, and there's another once the work is done.

Quote
We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Out of curiosity, have you looked into trying to code a version
directly in AMD's assembly language and bypassing OpenCL entirely ?
(I'm thinking: since we're already patching the ELF output, this seems
like the logical next step Smiley)

Also, have you looked at AMD CAL ? I know this is what ufasoft's miner
uses (https://bitcointalk.org/index.php?topic=3486.500), and also what
zorinaq considers the most efficient way to access AMD hardware (somwhere
on http://blog.zorinaq.com)



Replacing one instruction in the ELF with another that uses the exact same inputs/outputs is one thing, but manually editing the ASM code is another thing entirely. Besides, with the work that has been done the GPU is already at >99% of the theoretical maximum throughput. (ALU packing) And as said above, we are also close to the theoretical minimum number of instructions to correctly run SHA256.

Also, if you look near the end of the hdminer thread you will notice that users are able to get the same hashrates from phatk on 69xx. For 58xx and other VLIW5 cards phatk is significantly faster than hdminer. If that's the best he can do with CAL then I don't see any reason to use it. hdminer had a substantial performance advantage back in March/April, but with basically every miner supporting BFI_INT this is no longer the case.

Agreed, the kernel itself is pretty optimal.  I might look into calling lower level CAL functions to manage the (OpenCL compiled) GPU threads (instead of using openCL), but I doubt this will give any speedup (although, I might be able to reduce the CPU overhead).
I understand what you are saying.  Perhaps version2.1 will be the last version that works well with VECTORS4.  You said the work that has been done on the GPU is already at >99% of the theoretical maximum throughput.  But VECTORS4 alone gives me about 1.5% boost.(contraindication?)  That is why I tried hard to find a way to make VECTORS4 work so that the future versions can use it.
Phateus
Jr. Member
*
Offline Offline

Activity: 52


View Profile
August 16, 2011, 05:49:36 PM
 #277

Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass..


Yeah, mingw is most certainly a giant PITA.

To compile cgminer with mingw, the trick is to use msys and get pkg-config and libcurl installed properly

For pkg-config, the best is to install this: http://ftp.gnome.org/pub/gnome/binaries/win32/gtk+/2.22/gtk+-bundle_2.22.1-20101227_win32.zip

Once you have that, libcurl is rather easy.

Quote
trying a full cygwin install next...

Mmmh. Not sure this'll get you very far.

If your main dev box is windows and your goal is to integrate
phatk into cgminer, your best bet is probably to install a small
virtual machine (qemu or vmplayer) running ubuntu inside your
windows box and work on cgminer directly on Linux in there.

That's exactly what I do (the other way round) when I have to
try windows-specific things or a piece of code.


Yeah, I think I will stay away from using the mingw environment from now on... Cygwin was easy as pie.  No issues, I think can cross compile from cygwin using mingw if I want native Win32 support.  Apparently, getting pkg-conf (i think) working without POSIX support is terrible.  I got my kernel working around 5am last night linking against the cygwin dlls.. so tonight I will release the changes when I get home.

Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass.. trying a full cygwin install next...

Bear with me, hopefully I'll get it running tomorrow.

-Phateus
You could just tell me what to do to interface it with cgminer (i.e. what new variables you want) and I'd copy most of your kernel across. Only the return code and define macros are actually different in cgminer in the kernel itself.

Yeah, if you want, I can send you the changes tonight so you can put it in your release.  The only modifications I had to make to the kernel is changing VECTORS to VECTORS2 , hardcoding OUTPUT_SIZE = 4095 and hardcoding WORKSIZE=256 (I really do need this passed to the kernel though).  Also, my kernel only uses WORKSIZE+1 entries in the buffer, it would be better if you made the buffer that size.

As for the changes in the miner, I think I only had to change the precalc_hash() function, the kernel input and output file name, queue_phatk_kernel() function
what I will do tonight, is add KL_PHATK_2_2 to the cl_kernel enum and copy the function code and add the corresponding command line argument (right now I have just replaced PHATK with mine) and add -DWORKSIZE= arguments for the kernel.

Anyway, I will give you more details tonight when I am in front of my code.
My fork is https://github.com/Phateus/cgminer, I will upload the changes tonight (as soon as I figure out git... never used that before)

-Phateus

P.S. thanks for the easy to read code Smiley

http://deepbit.net/userbar/4dcec4d1816197e144000002_bfe143123a.png

Feeling Generous?
124RraPqYcEpX5qFcQ2ZBVD9MqUamfyQnv
-ck
Moderator
Legendary
*
Offline Offline

Activity: 1988


Ruu \o/


View Profile WWW
August 16, 2011, 10:18:23 PM
 #278

Seems to me like you've got it all under control, so I'll leave you to finish up. Thanks for your involvement. However I don't want multiple phatk kernels so just replace the current one in-situ and don't bother enumming a different kernel. As for the output code, I prefer to use 4k so feel free to do it your way, but be aware I plan to change it back.

Primary developer/maintainer for cgminer and ckpool/ckproxy.
Pooled mine at kano.is, solo mine at solo.ckpool.org
-ck
Phateus
Jr. Member
*
Offline Offline

Activity: 52


View Profile
August 17, 2011, 04:13:55 AM
 #279

Seems to me like you've got it all under control, so I'll leave you to finish up. Thanks for your involvement. However I don't want multiple phatk kernels so just replace the current one in-situ and don't bother enumming a different kernel. As for the output code, I prefer to use 4k so feel free to do it your way, but be aware I plan to change it back.

Ok, the source is up... I am trying to figure out how to compile this for windows without the cygwin layer (I really haven't done any of this before... I am soooo lost)...

https://github.com/Phateus/cgminer

ckolivas... if you want to merge this into your code at some point, let me know what I have to do... I literally installed git yesterday, and there is only so much you can learn on the internet in a day ;-)

As for the buffer, my kernel only uses WORKSIZE+1 parts of your buffer, but I left the buffer size intact.

http://deepbit.net/userbar/4dcec4d1816197e144000002_bfe143123a.png

Feeling Generous?
124RraPqYcEpX5qFcQ2ZBVD9MqUamfyQnv
-ck
Moderator
Legendary
*
Offline Offline

Activity: 1988


Ruu \o/


View Profile WWW
August 17, 2011, 05:14:20 AM
 #280

Seems to me like you've got it all under control, so I'll leave you to finish up. Thanks for your involvement. However I don't want multiple phatk kernels so just replace the current one in-situ and don't bother enumming a different kernel. As for the output code, I prefer to use 4k so feel free to do it your way, but be aware I plan to change it back.

Ok, the source is up... I am trying to figure out how to compile this for windows without the cygwin layer (I really haven't done any of this before... I am soooo lost)...

https://github.com/Phateus/cgminer

ckolivas... if you want to merge this into your code at some point, let me know what I have to do... I literally installed git yesterday, and there is only so much you can learn on the internet in a day ;-)

As for the buffer, my kernel only uses WORKSIZE+1 parts of your buffer, but I left the buffer size intact.
Very good work. Nice of you to figure out how to do git and all as well. Don't worry about the merge, I've taken care of everything and cherry picked your changes as I needed to. I've modified a few things too to be consistent with cgminer's code and there is definitely a significant speed advantage thanks to your changes. Note that if you're ever working on git doing your own changes, do them to a branch that's not called master as you may end up making it impossible to pull back my changes since I won't necessarily take all your code. Thanks again, and I'm sure the cgminer users will be most grateful. Smiley

Primary developer/maintainer for cgminer and ckpool/ckproxy.
Pooled mine at kano.is, solo mine at solo.ckpool.org
-ck
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 [14] 15 16 »  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!