Regarding Dagger in retrospect, which was Vitalik's original proposed proofofwork since before Ethereum was formed:
Thanks to Adam Back and Charles Hoskinson, for discussion and critique of earlier drafts of the protocol
Note Vitalik has apparently deleted that page from his website, but we still have
his Twitter announcement confirming that Dagger was created in 2013.
Charles had asked AnonyMint in Skype about Vitalik's design in late 2013 or start of 2014 as Charles was considering joining with Vitalik to form what became Ethereum, and AnonyMint told him that Dagger could be trivially parallelized, thus was not sequentially memory hard. Charles already in the past posted a confirmation on BCT of that discussion. Feel free to dig for it.
This should exemplify that as of the end of 2013, Adam Back was not as expert as AnonyMint was on sequentially memory hard proofofwork.
Because I presume within a short time after that, the following was published validating AnonyMint's analysis:
..the authors completely forgets another key property a PoW function must provide: it must be sequentialmemory hard. This means that not only the function should require large amounts of RAM, but it must not allow easy parallelization. Dagger seems to provide almost the best possible scenario for parallelization.
However, Dagger was proven to be vulnerable to shared memory hardware acceleration
by Sergio Lerner and was then dropped in favor of other avenues of research.
But Sergio did not state the much more relevant factor w.r.t. to mining economics, being that the electrical consumption could have been much more efficient on an ASIC implementation of Dagger, because Dagger is predominantly computation bound[1] and computation is more efficient with customized logic circuits, which btw is also the ASIC resistance Achilles heel of the
Hasimoto proofofwork (authored by Thaddeus Dryja, one of the creators of the
highly flawed Lightning Networks clusterfuck in the making) which was the base for the Ethereum's DaggerHashimoto (later renamed Ethash with some modifications). Ethash inherited the computation bound problem, because the Ethereum developers didn't even realize it is an issue:
...can be mined very quickly by using a large amount of memory. The intent of this is to force ASIC/FPGA designers to be bound by RAM access speeds, while still allowing for verification by thin clients.
IO saturation: The algorithm should consume nearly the entire available memory access bandwidth (this is a strategy toward achieving ASIC resistance, the argument being that commodity RAM, especially in GPUs, is much closer to the theoretical optimum than commodity computing capacity)
The difference between Dagger Hashimoto and Dagger is that, unlike in the original Dagger, the dataset used to query the block is semipermanent, only being updated at occasional intervals (eg. once per week). This means that the portion of the effort that goes toward generating the dataset is close to zero, so Sergio Lerner's arguments regarding shared memory speedups become negligible.
Expect computation bound also to be the ASIC resistance Achilles heel of Monero's (
ditto Zcash's) proofofwork if Monero's or Zcash's mining economics ever become lucrative enough to motivate ASIC development. As well tromp's Cuckoo Cycle once
the insecure use[2] of Siphash is restored to SHA256 as in tromp's
earlier version. I also expect that given Cuckoo Cycle is parallelizable (i.e
not sequentially memory hard) then (especially with
smaller memory footprints necessary to make Cuckoo Cycle's
proving time reasonable for instant transactions that) the memory latency (and thus
alleged high electrical consumption proportion of RAM versus computation)
can be masked by massive number of paused hardware threads (aka compute units) waiting to be synced (coalesced) on a memory
page (aka row); thus rendering Cuckoo Cycle executiontime computation bound up to
memory bandwidth bound and power consumption computation bound— thus not ASIC resistant.
*
* several forum members can vouch for that, although they'd probably prefer not to.
[1]  It is possible on the ASIC the computation would have been sufficiently faster causing the algorithm to become memory bandwidth or latency bound on execution time, yet the power consumption would likely remain compute bound thus making the ASIC much more electrically efficient.

[2]  JeanPhilippe Aumasson; Daniel J. Bernstein (2012). "SipHash: a fast shortinput PRF". https://131002.net/siphash/#dl
"generalpurpose cryptographic hash functions perform many extra computations for the goal of collision resistance on public inputs, while MACs have secret keys and do not need collision resistance"
"Note that the standard PRF and MAC security goals allow the attacker access to the output of SipHash on messages chosen adaptively by the attacker. However, they do not allow access to any “leaked” information such as bits of the key or the internal state." 
Edit:
is memory bandwidth something that an ASIC can easily overcome?
You misunderstood. Try reading it again more carefully and slowly. The memory bandwidth only limits the speedup on an ASIC, not the electrical efficiency gains. I wrote a very complex statement. Try to grasp what I am saying about Cuckoo Cycle. AnonyMint/TPTB conjectured that with enough hardware threads, it loses the ability to amortize a single memory latency over 1 hash computation and instead up to 1024 (for 4KB page/row) hashes coalesced for 1 memory latency (row loading) power cycle. Thus computation bound and ordersofmagnitude more power efficient on the ASIC. John Tromp had argued that the probability of any one thread being coalesced was 2^14 due to the ratio of the memory page (row) and bank sizes, but that is for any one h/w thread. And that was if the memory footprint of the Cuckoo table is greater than the memory bank size. His concept falls away when 1000s of hardware threads are run and/or smaller memory footprints are chosen. That is why TPTB_need_war urged tromp to buy a KillAWatt meter and measure, but even that won't isolate to the metrics needed to estimate the advantage of ASIC. Need a more sophisticated bench test and probably some hardware simulation.
Still trying to grasp the second statement ("It loses the ability to amortize a single memory latency over 1 hash computation...")
tromp thinks that Cuckoo Cycle has very low computation relative to memory latency. But this depends on only one 32bit bucket being read in each memory page/row. With massive number of h/w threads, can get them all synced up (or sorted) so that we read 1024 of 32bit buckets in one memory latency cycle, thus we have 1024 times more computation of hashes amortized over only one memory latency power cycle (where the new row is read into the row buffer on the DRAM memory bank). Thus from a power consumption standpoint, it becomes computation bound and thus the ASIC can be ordersofmagnitude more efficient.
² Preferably less than the roughly 50 nanosecond row activation delay for switching rows on a memory bank.
Cuckoo spends only 1/3 of its time doing hash computations, which take little power compared to DRAM switching rows. So the possible gain by ASIC hashing is less than 33%.
I didn't measure that. I estimated that doing 90 64bit operations (one siphash call) is cheaper than operating the (typically 16k) sense amplifiers to read a new bitline. But I could be wrong about that... I'm assuming a Cuckoo size much larger than 16k of course.
Edit#2: and here Vlad and the Ethereum team couldn't understand what AnonyMint had considered and quickly dismissed in 2014 because bydefinition a random circuit has no structure and thus can't be relied upon to have any proofofwork qualities. Duh!
Approaches that were tried between Dagger and Dagger Hashimoto but are currently not our primary focus include:
 "Random circuit"  a proof of work function developed largely by Vlad Zamfir that involves generating a new program every 1000 nonces  essentially, choosing a new hash function each time, faster than even FPGAs can reconfigure. The approach was temporarily put aside because it was difficult to see what mechanism one can use to generate random programs that would be general enough so that specialization gains would be low; however, we see no fundamental reasons why the concept cannot be made to work.
Another idea AnonyMint did consider was
cryptographically obscuring the proofofwork algorithm, but this would mean that only the developers would be able to audit and thus no one would know if the developers had a secretly superior implementation. It would be analogous to
the unverifiable trust required to setup the Zcash master key.
Edit#3:
How extensively is this specific NP problem studied?
The closest seems to be the study of tradeoffs in undirected graph traversal such as
http://www.cs.toronto.edu/~bor/Papers/timespacetradeoffsundirectedgraphtraversalgraphautomata.pdfwhich states:
Although it would be desirable to show a tradeoff for a general model of computation
such as a random access machine, obtaining such a tradeoff is beyond the
reach of current techniques. Thus it is natural to consider a ``structured'' model
(Borodin [16]), that is, one whose basic move is based on the adjacencies of the
graph, as opposed to one whose basic move is based on the bits in the graph's
encoding.
My model is rather odd in that you cannot easily figure out the neighbours of a node,
since edges are generated from nonces, and in the worst case you have to map over
all nonces to find the edges adjacent to a given node.
The paper establishes a quadratic lower bound on the
product of time and space for graph traversal, which I interpret
as good evidence for the edge trimming in my Cuckoo Cycle being optimal,
as it uses just 1 bit per edge.
They had an earlier paper as well:
http://homes.cs.washington.edu/~beame/papers/wag.pdfI found it helpful to also read the abstract of these two:
http://www2.tcs.ifi.lmu.de/~schoepp/Docs/formalcr.pdfhttp://www.sciencedirect.com/science/article/pii/S0304397500000190Your Cuckoo Cycle is a
log n space representation which discards the local information on the bidirectionality of the edges (i.e. the undirected graphic) so as to assume it is a DAG. The undirected quality of the actual graph is hidden from your representation, except by global reenumeration of the edges (via the "micro" nonces). For readers, tromp never explicitly defines this terminology but one can deduce he uses "macro nonce" to mean the nonce for the block. The "micro nonces" are the enumeration of the edges in graph.
You have not shown that Cuckoo Cycle can guarantee that a representation that stores the linear space of the graph won't have solutions
which are more efficient, perhaps only on an ASIC and/or GPU within the
log(mn) product of time and space lower bound tradeoff. For example, it may be more efficient to decrease traversal workspace and apply a probabilistic or nondeterministic JAG, because it may make it more efficient to get a fewer number threads to coalesce on page row latency, while any increased hash computations are relatively insignificant on the ASIC. Or the undirected graph may find more cycles, more efficiently.
The edge trimming from David Andersen appears to throw away information at the representation (storage of the edges) stage. The paper you cited is referring to compression of the traversal workspace of storing the state machine for visited the graph, i.e. the "pebbles" and "Q state" in the JAG model.
Also the traversal workspace in a Cuckoo Cycle finding proofofwork in an undirected case is amortizing workspace over numerous cycles found, which is a different algorithm than the singleshot st reachability traversal considered by the research papers cited above.