Yes, I am aware this is the claimed goal. This was also the claimed goal for litecoin— and unlike the cryptonote paper litecoin used a publicly known, peer reviewed algorithm. Litecoin failed in this goal because it is simply not really possible to achieve. You should go read the nice writeup I linked to. The cryptonight stuff is _less_ well reviewed and justified than the approach taken in litecoin. It's also insanely slow to verify, which is debilitating to many applications (including just syncing up a node).
Certainly, as you say, you can increase the upfront costs... and that might delay things a bit while the cryptocurrency is worthless. But higher upfront costs will almost guarantee a monopoly to someone who does go and create something (and, in fact, cpu only right now basically means an intel and amd duopoly— you realize that the marginal cost of a chip to someone with the right contracts with Intel is a tiny fraction of the retail price, right?), less competition in the hardware space is a risk, not an advantage. And, of course, litecoin asics have higher NRE and so there is less competition in that space. Making that worse seems unlikely to be an advantage.
Here are some quotes from the document that you've cited:
https://download.wpsoftware.net/bitcoin/asic-faq.pdf3. Is ASIC resistance possible?
ASIC resistance, in the sense of making life difficult for ASIC manufacturers (and therefore reducing the number of distinct manufacturers) is possible. But it is impossible to create an algorithm which runs at the same speed on general-purpose and dedicated hardware (since general-purpose hardware contains many extraneous features, e.g. communication buses for peripherals), and so ultimately ASIC resistance is futile.
This is true, however we're talking about the costs required to create an AISC, which would be as multi-purpose as CPU.
By the way, Ethereum's authors are working in the same direction, as their function will perform random scripts from the block chain. Since those algorithms are likely to be arbitrary (Turing complete language), the interpreter should also be multi-purpose.
4. Is memory hardness desirable?
As an aside, since memory is far away and expensive to access on general purpose computers, memory hardness actually increases the benefit provided by ASIC’s!
Our initial idea is that the main advantage of the ASICs is piplining. Since SHA-256 doesn't require more memory that the data processed in the gate (256 bits), it is possible to run different parts of the algorithm on the same circuit. On the other hand, our algo requires 2 MB to process the data. Moreover, it processes not everything at once, but random 64-bit parts, so it takes time for them to be transferred. This makes piplines impossible as it would require 2 MB on-chip memory per each line. 2 MBs is indeed much for the ASIC, since on-chip memory is more expensive than dram (the link you've provided also suggests this). In case someone decides to go with dram, the additional costs will grow because of the memory controllers. In a nutshell, CryptoNight makes CPU a suitable and convenient tool.
I wouldn't like to argue whether it is possible to create an ideal PoW function. Our main idea was to push the economic barrier for the ASICs to a significant distance.