For comparison, high-end GPUs can brute-force PBKDF2 (the wallet.dat password-to-keys stretching) at 4.5k-6k passes per second.
I'm not familiar with wallet file passwords but it sounds like something that would
also include AES.
Your measure on CPU isn't so bad,
To be clear the speed above is not just PBKDF2, it is PBKDF2 + BIP32 KDF including EC mult + child Private to Public key + RIPEMD160 & SHA256 Hash (for address address comparison) among other smaller things.
Speed gains are mainly found by optimizing the SHA512 implementation you are using - often this involves writing it in C with asm directives. The rest of the KDF simply call a bunch of HMACs which themselves call SHA512 thousands of times.
Although that's true but I have to disagree here. I consider optimization like above to be the least effective and only to be used in the "last squeeze" step when we want to get every last bit of performance.
In other words optimization does not start from the code or language, it starts from the algorithm. For example in this case it doesn't matter much if SHA512 is optimized any more than it is now if the "HMACs are simply called", instead we can modify the algorithm to
easily double the speed by avoiding 4,096 block compressions out of 8,192 each round, and that's just part of it.