Bitcoin Forum
August 14, 2018, 09:06:37 AM *
News: Latest stable version of Bitcoin Core: 0.16.2  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: [1]
  Print  
Author Topic: AMDGPU-Pro 17.40 with large page support  (Read 931 times)
nerdralph
Sr. Member
****
Offline Offline

Activity: 574
Merit: 251


View Profile
November 05, 2017, 03:04:33 PM
 #1

http://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Driver-for-Linux-Release-Notes.aspx
Anyone test it out yet on a card with a custom BIOS?

I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

However variable page size for a cache controller suggests a TLB, and comments on Phoronix seem to confirm this:
https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/977778-amdgpu-increasing-fragment-size-for-performance

I'm guessing this is just another example of why the GCN docs can't be 100% relied on, and to find out what's really going on it is necessary to go through the driver code and do your own tweaks and testing.
BOUNTY PORTALS
BLOG
WHERE BOUNTY MANAGEMENT
MEETS AUTOMATION
SIGNATURE CAMPAIGNS
TWITTER
FACEBOOK
MEDIA CAMPAIGNS
AND MORE!
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1534237597
Hero Member
*
Offline Offline

Posts: 1534237597

View Profile Personal Message (Offline)

Ignore
1534237597
Reply with quote  #2

1534237597
Report to moderator
1534237597
Hero Member
*
Offline Offline

Posts: 1534237597

View Profile Personal Message (Offline)

Ignore
1534237597
Reply with quote  #2

1534237597
Report to moderator
bridgman
Newbie
*
Offline Offline

Activity: 2
Merit: 0


View Profile
November 05, 2017, 09:22:26 PM
 #2

Quote
I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

That comment refers to not requiring TLBs in order to access the tag RAM and see if there is a cache hit, so requests that hit in cache are a bit faster. If the access misses in cache, then the request has to go through page tables (accelerated via TLBs) in order to access the correct memory location.
nerdralph
Sr. Member
****
Offline Offline

Activity: 574
Merit: 251


View Profile
November 06, 2017, 07:18:08 PM
 #3

Quote
I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

That comment refers to not requiring TLBs in order to access the tag RAM and see if there is a cache hit, so requests that hit in cache are a bit faster. If the access misses in cache, then the request has to go through page tables (accelerated via TLBs) in order to access the correct memory location.

Thanks for the explanation.  Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set?  By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables.  I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.
Wolf0
Legendary
*
Offline Offline

Activity: 1862
Merit: 1002


Miner Developer


View Profile
November 07, 2017, 03:39:45 AM
 #4

Quote
I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

That comment refers to not requiring TLBs in order to access the tag RAM and see if there is a cache hit, so requests that hit in cache are a bit faster. If the access misses in cache, then the request has to go through page tables (accelerated via TLBs) in order to access the correct memory location.

Thanks for the explanation.  Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set?  By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables.  I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.


Slower indeed. GLC alone is best. SLC alone, as well as SLC + GLC are worse than only GLC.

By the way, I finally took a look at why Claymore's "ASM" kernel for Ellesmere is so ass: Get this - he didn't actually do it properly, as in the whole thing in ASM, as he (arguably) implies. It's the output of the AMD OpenCL compiler (using the "-legacy" switch to make it use the older version) and then tweaked a bit. He even uses LDS... wtf. He DOES take advantage of ds_swizzle_b32, but he's still fuckin' wasting a lot of local mem writes + some reads, and the ds_swizzle_b32 is (IIRC) full-rate, but even so, he's wasting 4 clocks per when it could be done better. Additionally - this made me laugh IRL - his v10.0 miner looks for a kernel that does not exist. I was actually hella confused for a minute - double and triple checked the decoded GCN kernel binary (recovered from the memory of the miner process) - but, sure enough, it wasn't there. So, I finally figured, let's check the return value of clCreateKernel()... sure enough, it returns CL_INVALID_KERNEL_NAME (-46.) Apparently the miner (obviously) finds this error to be non-fatal and continues... but why the bloody fuck is it IN there? Huh

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
stash2coin
Jr. Member
*
Offline Offline

Activity: 100
Merit: 1


View Profile
November 07, 2017, 07:48:39 AM
 #5

Could be some test kernel that he removes before public release and not bothering to remove the reference to it because is not causing problems. One more funny thing his ZEC miner is looking for Nvidia libraries but the miner is only for AMD cards Smiley
nerdralph
Sr. Member
****
Offline Offline

Activity: 574
Merit: 251


View Profile
November 07, 2017, 06:16:09 PM
 #6

Quote
I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

That comment refers to not requiring TLBs in order to access the tag RAM and see if there is a cache hit, so requests that hit in cache are a bit faster. If the access misses in cache, then the request has to go through page tables (accelerated via TLBs) in order to access the correct memory location.

Thanks for the explanation.  Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set?  By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables.  I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.


Slower indeed. GLC alone is best. SLC alone, as well as SLC + GLC are worse than only GLC.

By the way, I finally took a look at why Claymore's "ASM" kernel for Ellesmere is so ass: Get this - he didn't actually do it properly, as in the whole thing in ASM, as he (arguably) implies. It's the output of the AMD OpenCL compiler (using the "-legacy" switch to make it use the older version) and then tweaked a bit. He even uses LDS... wtf. He DOES take advantage of ds_swizzle_b32, but he's still fuckin' wasting a lot of local mem writes + some reads, and the ds_swizzle_b32 is (IIRC) full-rate, but even so, he's wasting 4 clocks per when it could be done better. Additionally - this made me laugh IRL - his v10.0 miner looks for a kernel that does not exist. I was actually hella confused for a minute - double and triple checked the decoded GCN kernel binary (recovered from the memory of the miner process) - but, sure enough, it wasn't there. So, I finally figured, let's check the return value of clCreateKernel()... sure enough, it returns CL_INVALID_KERNEL_NAME (-46.) Apparently the miner (obviously) finds this error to be non-fatal and continues... but why the bloody fuck is it IN there? Huh

Thanks for confirming.  So bypassing L1 is faster, but bypassing L2 is slower.  But speeding up L2 cache misses with larger page tables is faster...
According to GCN architecture docs, L2 does more than just cache, handling things such as global atomics.  I now suspect it is also involved in queueing and arbitration for DRAM access from the CUs, which would explain the slowdown using SLC.

As for Claymore, ever since I first looked at his equihash kernels, I've thought he's a hack.  He seems to get most of his ideas from other people, rather than trying to fully understand the GPU hardware and OpenCL compiler.  Tweaking isa compiler output is something that zawawa was discussing early this year (in addition to llvm work to get inline asm working), so I suspect Claymore just got the idea from reading zawawa's posts.
Wolf0
Legendary
*
Offline Offline

Activity: 1862
Merit: 1002


Miner Developer


View Profile
December 03, 2017, 06:30:59 PM
 #7

Quote
I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

That comment refers to not requiring TLBs in order to access the tag RAM and see if there is a cache hit, so requests that hit in cache are a bit faster. If the access misses in cache, then the request has to go through page tables (accelerated via TLBs) in order to access the correct memory location.

Thanks for the explanation.  Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set?  By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables.  I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.


Slower indeed. GLC alone is best. SLC alone, as well as SLC + GLC are worse than only GLC.

By the way, I finally took a look at why Claymore's "ASM" kernel for Ellesmere is so ass: Get this - he didn't actually do it properly, as in the whole thing in ASM, as he (arguably) implies. It's the output of the AMD OpenCL compiler (using the "-legacy" switch to make it use the older version) and then tweaked a bit. He even uses LDS... wtf. He DOES take advantage of ds_swizzle_b32, but he's still fuckin' wasting a lot of local mem writes + some reads, and the ds_swizzle_b32 is (IIRC) full-rate, but even so, he's wasting 4 clocks per when it could be done better. Additionally - this made me laugh IRL - his v10.0 miner looks for a kernel that does not exist. I was actually hella confused for a minute - double and triple checked the decoded GCN kernel binary (recovered from the memory of the miner process) - but, sure enough, it wasn't there. So, I finally figured, let's check the return value of clCreateKernel()... sure enough, it returns CL_INVALID_KERNEL_NAME (-46.) Apparently the miner (obviously) finds this error to be non-fatal and continues... but why the bloody fuck is it IN there? Huh

Corrective comment removed at Claymore's request.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
nerdralph
Sr. Member
****
Offline Offline

Activity: 574
Merit: 251


View Profile
December 05, 2017, 03:38:46 AM
 #8

I just noticed that although mining with >2GB DAG is faster with 2M page size, DAG creation time is much slower (3-4x longer).
bridgman
Newbie
*
Offline Offline

Activity: 2
Merit: 0


View Profile
December 10, 2017, 01:54:07 PM
 #9

Thanks for the explanation.  Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set?  By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables.  I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.

The large page support should still help if L1/L2 caches are being bypassed - if anything it would help more. Bypassing L1/L2 cache skips access to the cache's tag rams but does not skip access to the page tables, so the reduced TLB thrashing from large page support should still be relevant.
Pages: [1]
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!