I strongly stand by the statement that there is no algorithm that is somehow going to be magically possible on GPUs but not possible on any FPGA hardware configuration. There is simply no resource to exploit unique to GPUs.
Nothing magic is required, just design an algorithm that actually plays to the real strengths of the GPUs.
In particular all currently popular hashing algorithms completely ignore the super fast floating point units in the GPUs and CPUs. Ultimately, some future generations of FPGAs will start including FPU blocks, very much like they started including DSP blocks years ago.
But currently the FPU performance gap is quite wide.
Picking winners in the ASIC game is relatively easy if one isn't afraid of occasionally changing the source code. When properly designed, it even doesn't need to be hard fork, just the version number of the PoW function needs to be explicitly recorded.
At the moment I don't have time to write a longer discussion, so for now I'll repost what I wrote in another thread. We'll see which of those new threads will get most intelligent discussion.
a non-trivial portion of the more complex parts of the instrucion set
Yeah, that is the key.
As a miner, this frightens me because when this era of "flexible ASICs" arrive, GPU miners will definitely be obsolete. Add this to the threat of Ethereum going POS, it seems like the odds are stacked against us regular home-based miners. This might be a signal that now is a good time to liquidate mining rig assets and just directly invest in coins.
The thing is that it is relatively easy to write hash function that are very ASIC-proof or FPGA-proof.
Bytom folks are a good example. Their goal was not to be general-ASIC-proof but to make sure that the ASIC that is fast at implementing their hash in their ASIC. So they wrote a hash function that uses lots of floating point calculations exactly in the way that their AI-oriented ASIC does. The hard part of understanding Bytom's "Tensority" algorithm is finding exact information about the actual ASIC chips that are efficient doing those calculations.
But the general idea is very simple: if you don't want your XYZ devices to become, play to their strengths in designing the hash function.
For XYZ==GPU start with GPUs strengths. I haven't studied the recent GPU universal shader architecture, but the main idea was to optimize particular floating point computation used in 3D graphics using homogeneous coordinates, like AX=Y, where A is 4*4 matrix and X is 4*1 vector <x,y,z,w> where w==1. So include lots of those in your hash function. In particular GPUs are especially fast when using FP16, a half-precision floating point.
For XYZ==CPU made by Intel/AMD using x86 architecture, again start with their strengths. They have unique FPU unit with unique 10-byte floating point format and unique 8-byte BCD decimal integer format. Additionally they have dedicated hardware to compute various transcendental functions. So use a lot of those doing chaotic irreducible calculations like
https://en.wikipedia.org/wiki/Logistic_map or
https://en.wikipedia.org/wiki/Lorenz_system . Of course one could write an emulation of those formats using quad-precision floating point (pairs of double-precision floats), but it will take many months.
During those months you have additional time to research more strengths of your GPUs or CPUs. Use them in a hard-fork to assure that the preferred vendor of your mining hardware continues to be Intel/AMD/Nvidia.