Bitcoin Forum
December 08, 2016, 04:32:48 AM *
News: To be able to use the next phase of the beta forum software, please ensure that your email address is correct/functional.
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 [14] 15 16 17 18 19 20 21 »  All
  Print  
Author Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13  (Read 101450 times)
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 06, 2011, 11:02:36 AM
 #261

It says otherwise in the first post Tongue Tongue

You are right, sorry for that! Just VECTORS2 is the way to go. I edited the first post.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
1481171568
Hero Member
*
Offline Offline

Posts: 1481171568

View Profile Personal Message (Offline)

Ignore
1481171568
Reply with quote  #2

1481171568
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
joulesbeef
Sr. Member
****
Offline Offline

Activity: 476


moOo


View Profile
August 06, 2011, 08:31:26 PM
 #262

Quote
Phoenix 1.5 has the bfipatcher.py included, so I never included it in any kernel package
strange must not have been in my guiminer's version of phoenix 1.5 which does seem different.


Quote
You are right, sorry for that! Just VECTORS2 is the way to go. I edited the first post


just to be clear though.. vectors vectors2 doesnt hurt anything.. it is just extraneous?


I preferred to leave it as it was lest typing and deleting when testing the versions.

mooo for rent
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 06, 2011, 09:31:21 PM
 #263

Quote
Phoenix 1.5 has the bfipatcher.py included, so I never included it in any kernel package
strange must not have been in my guiminer's version of phoenix 1.5 which does seem different.


Quote
You are right, sorry for that! Just VECTORS2 is the way to go. I edited the first post


just to be clear though.. vectors vectors2 doesnt hurt anything.. it is just extraneous?


I preferred to leave it as it was lest typing and deleting when testing the versions.


Yeah, it doesn't hurt, it's just ignored.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
bcforum
Full Member
***
Offline Offline

Activity: 140


View Profile
August 07, 2011, 01:39:19 AM
 #264

New version was just released, it should be the fastest for 69XX cards:
Download version 2011-08-04 (pre-release): http://www.mediafire.com/?upwwud7kfyx7788

This is the preferred switch for Phoenix in order to achieve comparable performance:
Code:
-k phatk AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS VECTORS2 WORKSIZE=256
or
Code:
-k phatk AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS VECTORS2 WORKSIZE=128

Please test this version with SDK 2.4 / SDK 2.5! SDK 2.1 performance seems worse, but at least it should work. Report any errors and problems here and let me know what you think.
Have a look at your cards temperatures, I got a report, that they may be lower, which would be great Smiley.

Regards,
Dia

I get  0.8MH/s faster with phoenix-r112, but temps do appear to be 3C-4C lower.

6970 Lightning (940,1375) x2
Ubuntu 10.10
SDK 2.4
Cat 11.3


If you found this post useful, feel free to share the wealth: 1E35gTBmJzPNJ3v72DX4wu4YtvHTWqNRbM
dishwara
Legendary
*
Offline Offline

Activity: 1372

Truth may get delay, but NEVER fails


View Profile
August 07, 2011, 03:52:36 AM
 #265

just to be clear though.. vectors vectors2 doesnt hurt anything.. it is just extraneous?
I preferred to leave it as it was lest typing and deleting when testing the versions.
But if their is 2 vectors like "vectors vectors2" , which will be taken in to acc. 1st one or last one in command line?
coz vectors2 & vectors both give different performances.
jedi95
Full Member
***
Offline Offline

Activity: 219


View Profile
August 07, 2011, 06:37:46 AM
 #266

The latest version (2011-08-04) has a major problem that I can see.

The assumption that there won't be more than 1 valid nonce per kernel execution is very wrong. At aggression 14 for example each kernel execution tests 2^30 nonces. The chance that there will be more than 1 valid nonce in any given kernel execution in this case is going to be about 2.5% (if I did the math right) This effectively causes a net loss in performance compared to the previous version at high aggression. At lower aggression values (10 and below) this is less of a problem since the performance loss in these cases will be much less than 1%.

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 07, 2011, 07:21:39 AM
 #267

The latest version (2011-08-04) has a major problem that I can see.

The assumption that there won't be more than 1 valid nonce per kernel execution is very wrong. At aggression 14 for example each kernel execution tests 2^30 nonces. The chance that there will be more than 1 valid nonce in any given kernel execution in this case is going to be about 2.5% (if I did the math right) This effectively causes a net loss in performance compared to the previous version at high aggression. At lower aggression values (10 and below) this is less of a problem since the performance loss in these cases will be much less than 1%.

You have to compare the loss of valid nonces to the higher efficiency because of the removed control flow in the kernel (all current GPUs dislike if/else and so on). I thought this tradeoff would be well worth it, but you could prove me wrong. I was thinking about a better way of writing the positive nonces into output, but that didn't work.

Any good ideas for that part of the kernel will be a big plus!

Dia


Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 07, 2011, 03:13:05 PM
 #268

Updated 1st post kernel performance data with SDK 2.5 and KernelAnalyzer 1.9 Cal 11.7 profile.

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Beta-coiner1
Hero Member
*****
Offline Offline

Activity: 518


View Profile
August 07, 2011, 05:55:18 PM
 #269

New version was just released, it should be the fastest for 69XX cards:
Download version 2011-08-04 (pre-release): http://www.mediafire.com/?upwwud7kfyx7788

This is the preferred switch for Phoenix in order to achieve comparable performance:
Code:
-k phatk AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS VECTORS2 WORKSIZE=256
or
Code:
-k phatk AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS VECTORS2 WORKSIZE=128

Please test this version with SDK 2.4 / SDK 2.5! SDK 2.1 performance seems worse, but at least it should work. Report any errors and problems here and let me know what you think.
Have a look at your cards temperatures, I got a report, that they may be lower, which would be great Smiley.

Regards,
Dia

I get  0.8MH/s faster with phoenix-r112, but temps do appear to be 3C-4C lower.

6970 Lightning (940,1375) x2
Ubuntu 10.10
SDK 2.4
Cat 11.3


I can confirm the temps difference,which I thought was strange.Using Catalyst 11.6B/SDK 2.5 on a 6950 @867/1250 using V 4 W64 F3 temps are 3 C lower using GUI miner.Hash rate has also increased 3 Mh's using those settings as well as invalids are definitely much lower vs. Phataeus.

If you believe this post has helped,please donate-

1JfwAiV8WbT9HtmQVvnkJ6KwUy9gLjzxdK
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 07, 2011, 07:15:22 PM
 #270

New version was just released, it should be the fastest for 69XX cards:
Download version 2011-08-04 (pre-release): http://www.mediafire.com/?upwwud7kfyx7788

This is the preferred switch for Phoenix in order to achieve comparable performance:
Code:
-k phatk AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS VECTORS2 WORKSIZE=256
or
Code:
-k phatk AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS VECTORS2 WORKSIZE=128

Please test this version with SDK 2.4 / SDK 2.5! SDK 2.1 performance seems worse, but at least it should work. Report any errors and problems here and let me know what you think.
Have a look at your cards temperatures, I got a report, that they may be lower, which would be great Smiley.

Regards,
Dia

I get  0.8MH/s faster with phoenix-r112, but temps do appear to be 3C-4C lower.

6970 Lightning (940,1375) x2
Ubuntu 10.10
SDK 2.4
Cat 11.3


I can confirm the temps difference,which I thought was strange.Using Catalyst 11.6B/SDK 2.5 on a 6950 @867/1250 using V 4 W64 F3 temps are 3 C lower using GUI miner.Hash rate has also increased 3 Mh's using those settings as well as invalids are definitely much lower vs. Phataeus.

I have to ask to understand you ... you say that my current pre-release version generates 3°C less heat for your card and invalid share rate is lower in comparison to the latest Phateus phatk?

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 07, 2011, 07:27:31 PM
 #271

To all happy new kernel users, there is one thing you should know ... there have been NO donations since 2011-07-31, which makes me a bit sad.

It's my free time that I put in here (it were many hours till now) and the motivation is not only to get a "Thank you!". Remember, you guys generate more BTC with the kernel mods. It doesn't matter if it's my mod, Phateus mod or any others mod ... just be a little thankful and you keep a free and fast kernel + a motivated kernel mixer Diapolo Wink.

No offense to all the great people who already donated a few bitcents or even more, who helped me testing this, who helped me fix bugs or who added great ideas into this work!

Regards,
Diapolo

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Beta-coiner1
Hero Member
*****
Offline Offline

Activity: 518


View Profile
August 07, 2011, 08:07:03 PM
 #272

I can confirm the temps difference,which I thought was strange.Using Catalyst 11.6B/SDK 2.5 on a 6950 @867/1250 using V 4 W64 F3 temps are 3 C lower using GUI miner.Hash rate has also increased 3 Mh's using those settings as well as invalids are definitely much lower vs. Phataeus.

I have to ask to understand you ... you say that my current pre-release version generates 3°C less heat for your card and invalid share rate is lower in comparison to the latest Phateus phatk?

Dia
[/quote]Yes,that would be correct.also sent a Bitcent your way to help out even though it might not be much.Here's hoping to more development for the 69xx architecture.Wink

If you believe this post has helped,please donate-

1JfwAiV8WbT9HtmQVvnkJ6KwUy9gLjzxdK
drlatino999
Sr. Member
****
Offline Offline

Activity: 335



View Profile
August 07, 2011, 10:59:38 PM
 #273

Using the recommended settings -
Code:
-k phatk AGGRESSION=12 BFI_INT FASTLOOP=false VECTORS VECTORS2 WORKSIZE=128

My 6950 dropped 3C, 5830 stayed the same.

Sappers clear the way
joulesbeef
Sr. Member
****
Offline Offline

Activity: 476


moOo


View Profile
August 07, 2011, 11:49:40 PM
 #274

Quote
WORKSIZE=128p
typo or something knew I dont know about?

mooo for rent
drlatino999
Sr. Member
****
Offline Offline

Activity: 335



View Profile
August 08, 2011, 01:00:03 AM
 #275

Typo, let me edit that to reflect.

Sappers clear the way
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 08, 2011, 04:31:25 AM
 #276

Quote
WORKSIZE=128p
typo or something knew I dont know about?


It's only a typo there ...

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
RedLine888
Full Member
***
Offline Offline

Activity: 163



View Profile
August 08, 2011, 09:19:31 AM
 #277

Hi! Dunno whether the info I provide would be of any use but nevertheless...

Installed the 2011-08-04 kernel version and got + ~4 MHs on 6950 and - ~3 MHs on 5870 and my 5870 became unstable!!!

It works at 990 core and 360 mem with the previous version of your kernel and is perfectly stable but with this new version the driver crashes after a few seconds at even 980 core. The temps are perfect and stay at less than 78 C.

Thanx though for your work!

If I helped and You do wanna thank:
1FoiQYVPtUwWWnrYe1oYV3GCtJP8YBe1fv
Feel free to PM if you're in a need of any help
-------------------------------------------------------------------
ssateneth
Legendary
*
Offline Offline

Activity: 1288



View Profile
August 08, 2011, 05:14:29 PM
 #278

I still don't know why people are doing "VECTORS VECTORS2". VECTORS is an invalid argument for diapolo phatk ever since 8-04. The only valid arguments are VECTORS2 and VECTORS4.
Quote
Important: since version 2011-08-04 (pre-release) you have to use the switch VECTORS2 instead of VECTORS. I made this change to be clear what vectors are used in the kernel (2- or 4-component). To use 4-component vectors use switch VECTORS4.

jedi95
Full Member
***
Offline Offline

Activity: 219


View Profile
August 08, 2011, 06:57:58 PM
 #279

You have to compare the loss of valid nonces to the higher efficiency because of the removed control flow in the kernel (all current GPUs dislike if/else and so on). I thought this tradeoff would be well worth it, but you could prove me wrong. I was thinking about a better way of writing the positive nonces into output, but that didn't work.

Any good ideas for that part of the kernel will be a big plus!

Dia

After looking at the code more carefully your method is only problematic if more than 1 vector component returns a valid nonce. The odds of this happening are EXTREMELY small, since you would have to find more than 1 valid hash in a range of only 2 or 4 hashes.

That said, I have devised a way to remove the if(nonce) control structure entirely. This makes a couple assumptions:

1. Control flow instructions have a large clock cycle penalty regardless of the branch taken (so you get 44 cycle penalty on Cypress and Cayman regardless of if H == 0)
2. Writing values to output[] for every nonce even if the nonce is invalid does not incur a significant clock cycle cost relative to the control flow instructions. (ideally <10 clocks, but if it's below ~30 the code below will still be faster than the current code)

The steps:

1. OR the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.

output[0] will be overwritten constantly by invalid nonces (since the 1-bit number from step 5 will be 0 unless the hash satisfies H == 0, the resulting 8-bit number will also be 0) output[>0] will contain valid nonces will a small chance of collisions.

Cypress and Cayman (58xx and 69xx respectively) have a 44 cycle latency for control flow instructions

Steps 1 - 8 should execute in 1 clock each (however they can't be vectorized, so this won't exploit any ILP)

Step 9 takes no longer than the current code for valid nonces, but this will now also apply to invalid nonces.

overall this should be fast, return only valid nonces, and retain the capability to return more than one nonce if the assumptions above are true.

An example of how even a single 1 in the input will cause the output of steps 1-5 to be 0:
--------------------------------------------------------------------------------------

H = 0000000000000001 0000000000000000

00000000 00000001
00000000 00000000
-------------------OR
00000000 00000001

0000 0000
0000 0001
----------OR
0000 0001

00 00
00 01
------OR
00 01

0 0
0 1
---OR
0 1

0
1
-NOR
0

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 08, 2011, 07:38:59 PM
 #280

You have to compare the loss of valid nonces to the higher efficiency because of the removed control flow in the kernel (all current GPUs dislike if/else and so on). I thought this tradeoff would be well worth it, but you could prove me wrong. I was thinking about a better way of writing the positive nonces into output, but that didn't work.

Any good ideas for that part of the kernel will be a big plus!

Dia

After looking at the code more carefully your method is only problematic if more than 1 vector component returns a valid nonce. The odds of this happening are EXTREMELY small, since you would have to find more than 1 valid hash in a range of only 2 or 4 hashes.

That said, I have devised a way to remove the if(nonce) control structure entirely. This makes a couple assumptions:

1. Control flow instructions have a large clock cycle penalty regardless of the branch taken (so you get 44 cycle penalty on Cypress and Cayman regardless of if H == 0)
2. Writing values to output[] for every nonce even if the nonce is invalid does not incur a significant clock cycle cost relative to the control flow instructions. (ideally <10 clocks, but if it's below ~30 the code below will still be faster than the current code)

The steps:

1. AND the low 16-bits of H against the high 16 bits
2. Take the resulting 16-bit number and OR the low 8 bits against the high 8-bits
3. Take the resulting 8-bit number and OR the low 4 bits against the high 4-bits
4. Take the resulting 4-bit number and OR the low 2 bits against the high 2-bits
5. Take the resulting 2-bit number and NOR the first bit against the second bit

6. do bitwise AND of the resulting 1-bit number against the nonce
7. take the result from #6 and XOR the low 16-bits against the high 16-bits
8. take the resulting 16-bit number from #7 and OR the low 8-bits against the high 8-bits
9. store the result by doing output[OUTPUT_SIZE] = OUTPUT[result of #8] = nonce

Steps 1-5 create a single bit indicating if the nonce meets H == 0. When you bitwise AND this against the nonce in step 6 you will get 0 for any invalid nonces and for valid nonces you will just get the nonce again. (1 AND X = X)

Steps 7-8 are to produce an 8-bit index that is 0 for all invalid nonces and hopefuly unique for each valid nonce assuming there are a small number of valid nonces. However in the worst case (more than 1 hash found in a single execution) at least 1 will be returned. However if 3 or less nonces are found per execution all of them should be returned in most cass.

output[0] will be overwritten constantly by invalid nonces (since the 1-bit number from step 5 will be 0 unless the hash satisfies H == 0, the resulting 8-bit number will also be 0) output[>0] will contain valid nonces will a small chance of collisions.

Cypress and Cayman (58xx and 69xx respectively) have a 44 cycle latency for control flow instructions

Steps 1 - 8 should execute in 1 clock each (however they can't be vectorized, so this won't exploit any ILP)

Step 9 takes no longer than the current code for valid nonces, but this will now also apply to invalid nonces.

overall this should be fast, return only valid nonces, and retain the capability to return more than one nonce if the assumptions above are true.

An example of how even a single 1 in the input will cause the output of steps 1-5 to be 0:
--------------------------------------------------------------------------------------

H = 0000000000000001 0000000000000000

00000000 00000001
00000000 00000000
-------------------OR
00000000 00000001

0000 0000
0000 0001
----------OR
0000 0001

00 00
00 01
------OR
00 01

0 0
0 1
---OR
0 1

0
1
-NOR
0

Thanks Jedi, I will look into this tomorrow, the last thing I tried was (and look into every piece of the output buffer):

Code:
const uint2 nonce = (uint2){((Vals[7].x == -H[7]) * W_3.x), ((Vals[7].y == -H[7]) * W_3.y)};

output[OUTPUT_MASK & (nonce.x >> 2)] = nonce.x;
output[OUTPUT_MASK & (nonce.y >> 2)] = nonce.y;

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 [14] 15 16 17 18 19 20 21 »  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!