Bitcoin Forum
April 30, 2017, 09:12:52 AM *
News: Latest stable version of Bitcoin Core: 0.14.1  [Torrent]. (New!)
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [15] 16 17 18 19 20 21 »  All
  Print  
Author Topic: [ANN][GRS][DMD] Pallas optimized groestlcoin / diamond etc. opencl kernel  (Read 50082 times)
This is a self-moderated topic. If you do not want to be moderated by the person who started this topic, create a new topic.
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
March 05, 2015, 08:56:35 AM
 #281


Pallas,

Are you planning on adding myriad-groestl support in the future? If not, could you explain why not? Is it because your groestl kernel is already faster than the myriad-groestl?

Also, are you planning on putting your work on github? Again, if not, could you explain why not?

It seems to me that both are important ways to further your efforts and establish your reputation.

Best regards as always.

HR

Myr-Groestl must do SHA256 as well, IIRC - of course pure Groestl is faster.

myr-groestl should be faster because its has a single round of groestl (14 iterations) + sha; groestlcoin is groestl + groestl again, so slower.
it's just that I do not have enough free time to work on all these algos.....
Now wolf0 just did a fantastic job on whirlpoolx and I want to understand the magic ;-)

Haha, you ain't seen impressive yet! Check the thread, I'm about to post again!

OMG, this means a lot less reading and TV this week for me LoL!

1493543572
Hero Member
*
Offline Offline

Posts: 1493543572

View Profile Personal Message (Offline)

Ignore
1493543572
Reply with quote  #2

1493543572
Report to moderator
1493543572
Hero Member
*
Offline Offline

Posts: 1493543572

View Profile Personal Message (Offline)

Ignore
1493543572
Reply with quote  #2

1493543572
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1493543572
Hero Member
*
Offline Offline

Posts: 1493543572

View Profile Personal Message (Offline)

Ignore
1493543572
Reply with quote  #2

1493543572
Report to moderator
smolen
Hero Member
*****
Offline Offline

Activity: 525


View Profile
March 06, 2015, 04:26:47 AM
 #282

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

Of course I gave you bad advice. Good one is way out of your price range.
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
March 06, 2015, 09:04:19 AM
 #283

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
March 06, 2015, 07:25:15 PM
 #284

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
smolen
Hero Member
*****
Offline Offline

Activity: 525


View Profile
March 07, 2015, 03:55:11 AM
 #285

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.
Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?

Of course I gave you bad advice. Good one is way out of your price range.
Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
March 07, 2015, 11:30:45 PM
 #286

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.
Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?

I don't think so.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
March 09, 2015, 08:45:42 AM
 #287

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
March 16, 2015, 12:26:56 PM
 #288

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
March 16, 2015, 12:46:38 PM
 #289

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
March 16, 2015, 12:53:20 PM
 #290

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
March 16, 2015, 01:12:34 PM
 #291

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.

I do not have the card so I can't test it, but I know that on hawaii it can use two wavefronts, but only 1 on tahiti.
Does your kernel run 2 wavefronts on tahiti, as the asm version does?

Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
March 16, 2015, 01:43:54 PM
 #292

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.

I do not have the card so I can't test it, but I know that on hawaii it can use two wavefronts, but only 1 on tahiti.
Does your kernel run 2 wavefronts on tahiti, as the asm version does?

Mine's got 2 waves in flight on Hawaii - I believe editing to get another wave in flight on Tahiti and Pitcairn should be simple, stand by.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
iju76
Full Member
***
Offline Offline

Activity: 195


View Profile
March 16, 2015, 01:44:41 PM
 #293

win7-64 -- sgminer-5-dev-neoscrypt-windows-new2 -- dr-14.7

http://s001.radikal.ru/i194/1503/f3/09a2627a6270.png
Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
March 16, 2015, 03:30:25 PM
 #294

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code

I'd really like to see that implemented on sgminer, but I'm not sure it'll be faster: nvidia is a much different architecture.
On a side note, interest in mining groestlcoin based PoW coins is fading because the only coin with enough volume is switching to 1/10 reward soon, the others are dying, dead or with very little reward anyway.

It'd be better to just bitslice the S-box, I think, since we don't have warp shuffle.

I've seen what you achived on whirlpoolx: assuming a similar improvement can be made on groestl as well, that would mean more than 80 Mh/s.
Now, since the time I can dedicate to such project is a few minutes a day, it would take months. Volounters? :-)

Hey - a few hours ago, I remembered your OpenCL frustrations with 14.9 and above, and decided to take a look at your Groestlcoin code again. Fixed it up just a little bit, and while the resulting binaries aren't quite as good as the ones using GCN assembly, they outperform the original OpenCL on its intended driver.

Stock Pallas' OpenCL, available in the OP, used with 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinpallas-03162015.png
Modified version of that OpenCL, used with the same 14.12 drivers (NSFW): https://ottrbutt.com/miner/groestlcoinwolf-03162015.png

I'm not running old drivers on any rig right now, and I don't intend to change that in the near future, so comparing my numbers to the numbers in the OP, 290X goes from 26.4MH/s to 29.11MH/s - substantial.
Other cards, as well as clocks and such are in the screenshot. Oh, and I know memclock doesn't matter here, but I set it to 1500 by force of habit.

thanks Wolf0, but I already got over 34, see op. (experimental v2, bin some posts ago)
it's 2-3% faster than asm version.
it's only for Hawaii and 14.12, though; 14.9 is damned!
next step is bitslicing, but I do not have the time to work on it ;-)

As I said - I noticed. However, notice the 280X speeds? You haven't been able to create binaries that good for any chip but Hawaii, AFAIK.

I do not have the card so I can't test it, but I know that on hawaii it can use two wavefronts, but only 1 on tahiti.
Does your kernel run 2 wavefronts on tahiti, as the asm version does?

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
March 16, 2015, 03:33:40 PM
 #295

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code
Hi,

Because of my curiosity I really had to check that bitsliced code Cheesy and well... I must say that NV has better instructions to do it:
__byte_perm(x, 0, 1010)>>s:  this could be emulated by an AND and a MAD24 and az SHR. 3 instead of 2 cycle.
__byte_perm(x, 0, 3232)>>s:  SHR, MAD24, SHR   also 3 instead of 2.
__byte_perm(x, y, 5410)      :  SHL, BFE      2 instead of 1 instr.  (Even the Intel SSE has many instructions for these things since ages :S)
And there are lots of bitwise logical instructions where NV is 2x faster because NV has a 3 op logic instruction with all the possible 16*16 logic operator combinations.
There are shuffling between 4 lanes: That is not a problem on GCN with ds_swizzle, otherwise it needs LDS on OpenCL.
I've just checked the GCN 1.3 ISA manual and (at least there) I haven't found byte_swizzle and no 3 operand logic instructions either.

Anyways, It would be interesting that how this totally different approach can perform compared to the table based one.
Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
March 16, 2015, 03:35:54 PM
 #296

I have groestl code from smelter (first GPU miner for quark). May be it have some tricks for your work. It was rather fast on radeon HD 5xxx series.
Today my code is obsolete, it has already been discussed in this thread. BTW, another source of tricks is cbuchner1's bitsliced and byteshuffled code
Hi,

Because of my curiosity I really had to check that bitsliced code Cheesy and well... I must say that NV has better instructions to do it:
__byte_perm(x, 0, 1010)>>s:  this could be emulated by an AND and a MAD24 and az SHR. 3 instead of 2 cycle.
__byte_perm(x, 0, 3232)>>s:  SHR, MAD24, SHR   also 3 instead of 2.
__byte_perm(x, y, 5410)      :  SHL, BFE      2 instead of 1 instr.  (Even the Intel SSE has many instructions for these things since ages :S)
And there are lots of bitwise logical instructions where NV is 2x faster because NV has a 3 op logic instruction with all the possible 16*16 logic operator combinations.
There are shuffling between 4 lanes: That is not a problem on GCN with ds_swizzle, otherwise it needs LDS on OpenCL.
I've just checked the GCN 1.3 ISA manual and (at least there) I haven't found byte_swizzle and no 3 operand logic instructions either.

Anyways, It would be interesting that how this totally different approach can perform compared to the table based one.

Well, I've done Whirlpool-512 with no lookups at all, and it kinda sucks on GPU. It'll probably be a beast on FPGA, though!

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
smolen
Hero Member
*****
Offline Offline

Activity: 525


View Profile
March 16, 2015, 07:47:11 PM
 #297

Yes, transpose, do bitsliced calculation and transpose back, that will work. Does GCN have something like PMOVMSKB?
I don't think so.
May be VCC (vector condition code) will do the trick, so normal and bitsliced operations could be cheaply interleaved

NV has a 3 op logic instruction with all the possible 16*16 logic operator combinations.
I've just checked the GCN 1.3 ISA manual and (at least there) I haven't found byte_swizzle and no 3 operand logic instructions either.
Yes, AMD's GCN is overplayed by VPTERNLOGD and VPTERNLOGQ from Intel AVX512 and LOP3.LUT by NVidia Sad

Of course I gave you bad advice. Good one is way out of your price range.
smolen
Hero Member
*****
Offline Offline

Activity: 525


View Profile
March 16, 2015, 07:54:01 PM
 #298

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.
Have you rotated table values left by 3 bits? Wink Not sure it will help with register usage through...

Of course I gave you bad advice. Good one is way out of your price range.
Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
March 16, 2015, 07:59:02 PM
 #299

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.
Have you rotated table values left by 3 bits? Wink Not sure it will help with register usage through...

Rotations seem to hurt reg usage a bit. The source REALLY needs cleaning, but IMO, it's rather well done code by Pallas. I'm not really used to seeing anyone with a semblance of clue doing AMD miners.  Tongue

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
March 16, 2015, 09:02:43 PM
 #300

Hmm... it's one hell of a lot harder than I anticipated to lose two goddamned VGPRs than I thought it'd be.
Have you rotated table values left by 3 bits? Wink Not sure it will help with register usage through...

Rotations seem to hurt reg usage a bit. The source REALLY needs cleaning, but IMO, it's rather well done code by Pallas. I'm not really used to seeing anyone with a semblance of clue doing AMD miners.  Tongue

Now I've put some parts of the code (ex. the list of rbtts) in pragma unrolled for loops and it looks much better ;-)

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [15] 16 17 18 19 20 21 »  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!