Bitcoin Forum
December 03, 2016, 11:53:02 AM *
News: Latest stable version of Bitcoin Core: 0.13.1  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 [13] 14 15 16 »  All
  Print  
Author Topic: Modified Kernel for Phoenix 1.5  (Read 92108 times)
ssateneth
Legendary
*
Offline Offline

Activity: 1288



View Profile
August 10, 2011, 07:52:14 PM
 #241

Why not make two separate kernels then?

VECTORS4 might one day be the better alternative, instead of doing all that work then why not start now and keep pace?



Because I have literally put in over 100 hours on the main kernel and have gotten almost nothing in donations.  I just don't have the time to keep up with two kernels.  If anyone feels like making a VECTORS4 branch, go for it... the source code is in the public domain and you can use how you'd like.  Wink

Also, from what I've gathered, there may be only 1 or 2 people interested it... If you can lower your memory speed, I think VECTORS will always be faster than VECTORS4.

Now, I do like hearing feedback from everyone. I am just letting you know that it is not feasible to optimize the kernel for every possible configuration (SDK 2.1, 2.4, slow memory).  Right now, the kernel is optimized for SDK 2.5 and the 68xx and 5xxx cards and assuming you pick the best memory clock speed for your card (somewhere around 1/3 of your core clock).

-Phateus
the thing is, VECTORS4 worked perfectly for me in version 2.1
in version 2.2 its broken

As in it doesn't work at all, or that it is much slower?... Just use version 2.1 then

The behavior is as if it's not doing 4 nonces, but only doing 1 (i.e. no VECTORS option specified). My compute speed remained the same regardless of memory speed, which is exactly like your V1 result on the graph on page 1.

1480765982
Hero Member
*
Offline Offline

Posts: 1480765982

View Profile Personal Message (Offline)

Ignore
1480765982
Reply with quote  #2

1480765982
Report to moderator
1480765982
Hero Member
*
Offline Offline

Posts: 1480765982

View Profile Personal Message (Offline)

Ignore
1480765982
Reply with quote  #2

1480765982
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
critical
Full Member
***
Offline Offline

Activity: 162


View Profile
August 11, 2011, 09:49:54 AM
 #242

in guiminer, i keep getting invalid buffer, unable to write to file, wonder why
Diapolo
Hero Member
*****
Offline Offline

Activity: 769



View Profile WWW
August 11, 2011, 11:09:00 AM
 #243

Just did a test:

Rig setup:
  Linuxcoin v0.2b (Linux version 2.6.38-2-amd64)
  Dual HD5970 (4 GPU cores in the rig)
  Mem clock @ 300Mhz
  Core clock @ 800Mhz
  VCore @ 1.125v
  AMD SDK 2.5
  Phoenix r100
  Phatk v2.2
  -v -k phatk BFI_INT VECTORS WORKSIZE=256 AGGRESSION=11 FASTLOOP=false

Result:
  Overall Rig rate: 1484 MH/s
  Rate per core: 371 MH/s

This is ~4MH/s faster than Diapolo's latest.

On 5970, phatk 2.2 is current king of the hill.

For the world to be perfect, this kernel needs to be integrated into cgminer Smiley



The last kernel releases show, that it is a bit of trial and error to find THE perfect kernel for a specific setup. Phaetus and I try to use the KernelAnalyzer and our Setups as a first measurement, if a new Kernel got "faster". But there are many different factors that come into play like OS, driver, SDK, miner-software and so on.

I would suggest that we should try to create a kernel which is based on the same kernel-parameters for phatk and phatk-Diapolo so that the users are free to chose which kernel is used. One thing is CGMINER kernel uses the switch VECTORS2, where Phoenix used only VECTORS (which I changed to VECTORS2 in my last kernel releases). It doesn't even matter to use the same variable names in the kernel (in fact they are different sometimes) as long as the main miner software passes the awaited values in a defined sequence to the kernel.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
MegaBux
Jr. Member
*
Offline Offline

Activity: 33


View Profile
August 11, 2011, 03:26:33 PM
 #244

As of version 2.1, phatk now has command line option "VECTORS4" which can be used instead of "VECTORS".
This option works on 4 nonces per thread instead of 2 and may increase speed mainly if you do not underclock your memory, but feel free to try it out.  Note that if you use this, you will more than likely have to decrease your WORKSIZE to 128 or 64.

I'm using a 6770 @ 1.01Ghz with phatk 2.2.  When I run the memory clock at 300Mhz with the VECTORS option, I get 234.5Mhps.  However, I can't seem to reap the benefits of VECTORS2 or VECTORS4 at a higher memory clock (i.e. 1.2Ghz).  I've reduced the WORKSIZE from 256 to 128 and 64 and peak around 213Mhps;  with these options, I can only achieve between 204 and 213 Mhps.
Phateus
Jr. Member
*
Offline Offline

Activity: 52


View Profile
August 11, 2011, 04:33:14 PM
 #245

Just did a test:

Rig setup:
  Linuxcoin v0.2b (Linux version 2.6.38-2-amd64)
  Dual HD5970 (4 GPU cores in the rig)
  Mem clock @ 300Mhz
  Core clock @ 800Mhz
  VCore @ 1.125v
  AMD SDK 2.5
  Phoenix r100
  Phatk v2.2
  -v -k phatk BFI_INT VECTORS WORKSIZE=256 AGGRESSION=11 FASTLOOP=false

Result:
  Overall Rig rate: 1484 MH/s
  Rate per core: 371 MH/s

This is ~4MH/s faster than Diapolo's latest.

On 5970, phatk 2.2 is current king of the hill.

For the world to be perfect, this kernel needs to be integrated into cgminer Smiley



The last kernel releases show, that it is a bit of trial and error to find THE perfect kernel for a specific setup. Phaetus and I try to use the KernelAnalyzer and our Setups as a first measurement, if a new Kernel got "faster". But there are many different factors that come into play like OS, driver, SDK, miner-software and so on.

I would suggest that we should try to create a kernel which is based on the same kernel-parameters for phatk and phatk-Diapolo so that the users are free to chose which kernel is used. One thing is CGMINER kernel uses the switch VECTORS2, where Phoenix used only VECTORS (which I changed to VECTORS2 in my last kernel releases). It doesn't even matter to use the same variable names in the kernel (in fact they are different sometimes) as long as the main miner software passes the awaited values in a defined sequence to the kernel.

Dia

A good idea.

A further improvement: I'd like to have an option in my miner that spends ~2mn
benchmarking all the kernels available in the current directory (without talking to
a pool, i.e. doing pure SHA256 on bogus nonces), and picking the fastest for the
current rig.

For people with lots of different rigs/setups, that would save them the headache
of having to hand-tune each instance.


What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).

http://deepbit.net/userbar/4dcec4d1816197e144000002_bfe143123a.png

Feeling Generous?
124RraPqYcEpX5qFcQ2ZBVD9MqUamfyQnv
Phateus
Jr. Member
*
Offline Offline

Activity: 52


View Profile
August 11, 2011, 04:50:32 PM
 #246

As of version 2.1, phatk now has command line option "VECTORS4" which can be used instead of "VECTORS".
This option works on 4 nonces per thread instead of 2 and may increase speed mainly if you do not underclock your memory, but feel free to try it out.  Note that if you use this, you will more than likely have to decrease your WORKSIZE to 128 or 64.

I'm using a 6770 @ 1.01Ghz with phatk 2.2.  When I run the memory clock at 300Mhz with the VECTORS option, I get 234.5Mhps.  However, I can't seem to reap the benefits of VECTORS2 or VECTORS4 at a higher memory clock (i.e. 1.2Ghz).  I've reduced the WORKSIZE from 256 to 128 and 64 and can only seem to peek at 213Mhps.  With these options, I can only achieve between 204 and 213 Mhps.

I have found that VECTORS4 is extremely unreliable... even tiny changes in the kernel and other factors affect the hashrate tremendously...  OpenCL gets really weird when you use a lot of registers.  I added it in 2.1 because it was comparable to VECTORS in some situations, but changing the kernel slightly in 2.2 seems to have broken it (even though kernel analyer says it uses less registers and less ALU ops... *sigh*)

Anyone wondering about any new kernel improvements, I seem to be at a standstill... I have tried the following:
  • Removing all control flow operations (about 1MH/s slower)
  • Sending all kernel arguments in a buffer (about 1MH/s slower)
  • Using an atomic counter for the output so that the output buffer is written sequentially (about the same speed and only works on ATI xxx cards and newer)
  • Using an internal loop in the kernel to process multiple nonces (Either significantly slower or massive desktop lag)
  • Calling set_arg only once per getwork instead of once per kernel call (only faster when using very low aggression and FASTLOOP, I will add this to my next kernel release)

-Phateus

http://deepbit.net/userbar/4dcec4d1816197e144000002_bfe143123a.png

Feeling Generous?
124RraPqYcEpX5qFcQ2ZBVD9MqUamfyQnv
jedi95
Full Member
***
Offline Offline

Activity: 219


View Profile
August 11, 2011, 08:44:55 PM
 #247


What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).

In most cases you won't see much if any decrease in the number of getwork requests by running multiple kernels behind the same work queue. The reason for having a work queue in the first place is so that the miner only needs to ask for more work when the queue falls below a certain size. During normal operation Phoenix won't request more work than absolutely necessary. There might be a small benefit to doing this when the block changes, but aside from that the getwork count for a single instance running 2 kernels compared to 2 instances will be very close.

That said, I am interested to see the results of the other changes you mentioned. Feel free to PM me if you have any questions.

Phoenix Miner developer

Donations appreciated at:
1PHoenix9j9J3M6v3VQYWeXrHPPjf7y3rU
deepceleron
Legendary
*
Offline Offline

Activity: 1470



View Profile WWW
August 12, 2011, 03:12:32 AM
 #248

Big Edit:

I looked again at the AMD APP SDK v2.5, trying to get it to not suck. I did one more thing, not only did I install the 2.5 SDK (on Catalyst 11.6), but I also re-compiled pyopencl 0.92 against the newer SDK. On phatk 2.2, changing just from 2.4 SDK to 2.5 SDK with a matching pyOpenCL gets a hair more mhash:
SDK 2.4: 309.97
SDK 2.5: 310.10

Just to let people know, regarding the APP SDK, the version installed as well as the version used to compile pyopencl both seem to matter (not that this helps you if you are using just the prepackaged Windows phoenix.exe.)

Using a pyOpenCL newer than 0.92 gives a deprecation warning:

[0 Khash/sec] [0 Accepted] [0 Rejected] [RPC]kernels\phatk\__init__.py:414: Depr
ecationWarning: 'enqueue_read_buffer' has been deprecated in version 2011.1. Ple
ase use enqueue_copy() instead.
  self.commandQueue, self.output_buf, self.output)
[11/08/2011 21:10:22] Server gave new work; passing to WorkQueue
[291.32 Mhash/sec] [0 Accepted] [0 Rejected] [RPC (+LP)]kernels\phatk\__init__.p
y:427: DeprecationWarning: 'enqueue_write_buffer' has been deprecated in version
 2011.1. Please use enqueue_copy() instead.
  self.commandQueue, self.output_buf, self.output)


Using pyOpenCL 2011.1.2 with the kernel in its current form gets me less mhash though:
SDK 2.4: 307.98
SDK 2.5: 307.84

(5830@955/350; Catalyst 11.6; Win7; py 2.6.6)

CYPER
Hero Member
*****
Offline Offline

Activity: 630



View Profile
August 12, 2011, 03:24:26 AM
 #249

Using the latest 2.2 version got quite a noticeable increase:

Before:
4x 440Mh/s = 1760Mh/s

After:
4x 446Mh/s = 1784Mh/s

My best settings are:
Worksize = 256
Aggresion = 12
VECTORS

If this post helped you and you feel generous you know what to do: 1P9tXFy9bVgzrfPGeV7F8np26ZtFdCCWvz
Tx2000
Full Member
***
Offline Offline

Activity: 182



View Profile
August 12, 2011, 03:46:11 AM
 #250



What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).

Would definitely be interested in a cgminer fork.  Don't get me wrong, phoenix is great and has always given me the best performance overall but it does lack some of the more refined features, which the other poster listed above.  Failover and nice static but updated command line "UI".  Seems like you and diapolo are hitting the ceiling with phoenix anyway.
hugolp
Hero Member
*****
Offline Offline

Activity: 742



View Profile
August 12, 2011, 06:54:36 AM
 #251

There is a thing I dont understand about the results of these modifications. They increase the hash rate but they also increase consumption, and I always though that since they are making the kernel more efficient (same task with less instructions, less work for the gpu per hash) they should increase the hash rate without chaning consumption too much. Does anyone know why the more efficient kernel is not also more energy efficient?

Also, if one of you guys is out of ideas to make the cards runs faster it could be interesting to target energy efficiency instead of speed. A lot of us are not interested in running our cards at the maximum MHash/s rate but are more interested on having a better MHash/J rate.

talldude
Member
**
Offline Offline

Activity: 74


View Profile
August 12, 2011, 01:23:02 PM
 #252

It is more efficient - the more output per unit time you have, the more efficient it is since the card will be wasting less power sitting idle.

If you want to increase efficiency, that is a hardware thing - namely undervolt your card.
bcforum
Full Member
***
Offline Offline

Activity: 140


View Profile
August 12, 2011, 01:28:56 PM
 #253

There is a thing I dont understand about the results of these modifications. They increase the hash rate but they also increase consumption, and I always though that since they are making the kernel more efficient (same task with less instructions, less work for the gpu per hash) they should increase the hash rate without chaning consumption too much. Does anyone know why the more efficient kernel is not also more energy efficient?

Also, if one of you guys is out of ideas to make the cards runs faster it could be interesting to target energy efficiency instead of speed. A lot of us are not interested in running our cards at the maximum MHash/s rate but are more interested on having a better MHash/J rate.


In theory, fewer ALU ops translates to less energy consumption. In practice, each ALU op uses a slightly different amount of power and a kernel which 10x instruction A may burn more power than 12x instruction B. Unfortunately, instruction power numbers aren't documented anywhere so it is almost impossible to optimize in a theoretical sense, and could vary from GPU to GPU (due to minor manufacturing defects.)

One of Diapolo's recent kernels lowered operating temperature by ~3C without changing hashrate significantly. Presumably that particular kernel is ~10% more power efficient than others.

If you found this post useful, feel free to share the wealth: 1E35gTBmJzPNJ3v72DX4wu4YtvHTWqNRbM
hugolp
Hero Member
*****
Offline Offline

Activity: 742



View Profile
August 12, 2011, 01:35:19 PM
 #254

In theory, fewer ALU ops translates to less energy consumption. In practice, each ALU op uses a slightly different amount of power and a kernel which 10x instruction A may burn more power than 12x instruction B. Unfortunately, instruction power numbers aren't documented anywhere so it is almost impossible to optimize in a theoretical sense, and could vary from GPU to GPU (due to minor manufacturing defects.)

One of Diapolo's recent kernels lowered operating temperature by ~3C without changing hashrate significantly. Presumably that particular kernel is ~10% more power efficient than others.

Thanks for the answer. Can you indicate the version of Diapolo's kernel you are refering to?
Phateus
Jr. Member
*
Offline Offline

Activity: 52


View Profile
August 12, 2011, 05:53:22 PM
 #255



What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).

Would definitely be interested in a cgminer fork.  Don't get me wrong, phoenix is great and has always given me the best performance overall but it does lack some of the more refined features, which the other poster listed above.  Failover and nice static but updated command line "UI".  Seems like you and diapolo are hitting the ceiling with phoenix anyway.

I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).

We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Now that doesn't mean that there is NO room for improvement, just that any other improvement will probably have to be faster hardware, a more efficient implementation of openCL by AMD or figuring out a better way to finagle the current openCL implementation to reduce the implementation overhead.  But, unless there is a problem with pyopenCL, c and python should give equivalent speeds as long as they are just calling the openCL interface (the actual miner uses negligible resources).  I suppose it could be possible to access the hardware drivers directly and run the kernel that way... but I don't see that as being feasible.

But, with all of that said, I have looked through some of his code, and it some really clean code.  Part of the reason I want to add these features is to learn more python (this is the first thing I have programmed in python), but it probably will just be easier modifying the cgminer code.  Thanks for pointing out cgminer to me Smiley

http://deepbit.net/userbar/4dcec4d1816197e144000002_bfe143123a.png

Feeling Generous?
124RraPqYcEpX5qFcQ2ZBVD9MqUamfyQnv
Tx2000
Full Member
***
Offline Offline

Activity: 182



View Profile
August 12, 2011, 06:04:56 PM
 #256

Sent another donation your way.  Look forward to your work on cgminer.
Phateus
Jr. Member
*
Offline Offline

Activity: 52


View Profile
August 12, 2011, 06:30:17 PM
 #257

Sent another donation your way.  Look forward to your work on cgminer.

Thanks Cheesy

http://deepbit.net/userbar/4dcec4d1816197e144000002_bfe143123a.png

Feeling Generous?
124RraPqYcEpX5qFcQ2ZBVD9MqUamfyQnv
bcforum
Full Member
***
Offline Offline

Activity: 140


View Profile
August 12, 2011, 06:50:47 PM
 #258

Thanks for the answer. Can you indicate the version of Diapolo's kernel you are refering to?

https://bitcointalk.org/index.php?topic=25860.msg428882#msg428882

If you found this post useful, feel free to share the wealth: 1E35gTBmJzPNJ3v72DX4wu4YtvHTWqNRbM
BOARBEAR
Member
**
Offline Offline

Activity: 77


View Profile
August 12, 2011, 07:38:31 PM
 #259

I took a look at the comparison between version 2.2 and version 2.1
could it because __constant uint ConstW[128] change that broke VECTORS4?
Phateus
Jr. Member
*
Offline Offline

Activity: 52


View Profile
August 12, 2011, 08:01:13 PM
 #260

I took a look at the comparison between version 2.2 and version 2.1
could it because __constant uint ConstW[128] change that broke VECTORS4?

That change is inconsequential (I was trying some things that required the change but did not keep them).. the compiler doesn't use those values, so they code should be exactly the same doing it either way (you can try and replace the code with the old code if you want to check).

You keep saying that it is broken.. if it does not run, post the errors.

I have found that on my card, VECTORS4 is much slower in version 2.2 than 2.1, but this is not a bug... it seems to be because openCL does not like allocating that many registers... Version 2.1 uses around 99.7% of instruction slots with VECTORS4 and I have tried many many ways to make it faster and more reliable (in 2.1), but I have given up on it.  It is still in the release because I don't see any point in taking it out...  but getting 2.2 to run as fast as 2.1 with VECTORS4 is not going to happen.  Also, the differences between 2.1 and 2.2 with VECTORS are very tiny anyway (less than .5%)...

Getting into more detail about it: If you look at the graph on the main page of the thread, you can see the graph of VECTORS4 in version 2.1... in version 2.2 for some reason, the spike (and corresponding valley) is located higher (somewhere around 500), this could mean that it would be just as fast if you had 1500 Mhz memory, but I have no idea why openCL reacts this way to changing the memory speed.  There are way to many GPU architecture/GPU bios/PCIe bus/CPU-GPU transfer/driver/openCL implementation unknowns to try to predict this behavior.


-Phateus

http://deepbit.net/userbar/4dcec4d1816197e144000002_bfe143123a.png

Feeling Generous?
124RraPqYcEpX5qFcQ2ZBVD9MqUamfyQnv
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 [13] 14 15 16 »  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!