Thanks for the explanation. Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set? By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables. I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.
The large page support should still help if L1/L2 caches are being bypassed - if anything it would help more. Bypassing L1/L2 cache skips access to the cache's tag rams but does not skip access to the page tables, so the reduced TLB thrashing from large page support should still be relevant.