Please don't go inventing your own cryptography when it can be avoided.
Scrypt did the work to write up an extensive analysis of their effort, and it was reviewed by many people at great cost. And even then the result has not been totally criticism free (
http://eprint.iacr.org/2013/525).
I understand that new crypto has to go through a complicated vetting process to be trusted, and there are good reasons for this. I am just curious as to why something simpler was not settled on in this case. Unnecessary complexity is not our friend.
With a quick glance at the shape of your implementation, it could have trivially failed to be memory hard at all had you appended instead of prepending. I expect that it's actually spending all of its time shuffling around strings, and that a gpu implementation of it would still end up being a zillion times faster...
Lower level code that preinitializes the byte array and moves the pointer down with each iteration could eliminate string-shuffling in software. (I don't know what Python is doing when it prepends -- maybe it's all linked lists or something.) Why would a GPU implementation be faster?
Yes, Scrypt also put high pressure on CPU cache.
Can you elaborate on this? I couldn't find anything in the scrypt paper, or its Wikipedia article regarding "high pressure on the cpu cache". What does that even mean?