Do you mean morphing the "bits" field in the header? Is it legal to use an lower-than-required "bits" in bitcoin header?
No; the commitment included in the block is a hashtree with leaves that look like:
{index for how many back, previous_block_hash, compact encoded work difference between now and then}
...
For multiple backpointers. A back pointer is only valid (considered acceptable for traversal in a proof) if the block committing to it meets the work-difference threshold. Similar to merge mining, it's possible to perform N-hashcashes in parallel all with different thresholds. Just like if namecoin's difficulty were higher than Bitcoin you could mergemine namecoin and have a bitcoin block with a namecoin commitment in it which is still useless for namecoin though its a valid bitcoin block.
So to follow a link that goes one block back, you need bits work-- this is the normal case.
To go follow a link that jumps two back your block must be valid at the 2*bits level... and so on, to successfully mine a 10000 back hop you need to have a block that is valid when considered with a work threshold of whatever was spanned by those blocks (e.g. 10000x greater if difficulty has been constant).
This is why the proof can hide data, but you can't fake one with less work (on average) than just mining normally. Individually any block is unlikely to jump far back, but on the aggregate long jumps happen and as a result the proof is log sized in the length of the chain. (Actually less when the hashrate increases constantly).
For a given set of commitments there are multiple possible paths, and so multiple proofs for the same chain possible-- if you could have gone 100 back you could also go 50 back. Sometimes going less back makes for a smaller proof, perhaps at 51 back there is a single hop that goes all the way. The smallest proof can be constructed in linear time via dynamic programming: Start with the block you're trying to reach with your proof (e.g. genesis), and scan forward, for each block keep track of the size of the best proof from it to the destination and the first hop back, when you consider a block check each of its valid backlinks, add the cost of taking that backlink to the cost of its destinations best. Take the minimum as your best and move to the next block. When you get to the end walk backwords. This algorithm always produces the smallest possible proof, though a simple greedy algorithm (always jump as far as you can but not past your destination) isn't usually _that_ much larger.
The chance of success is higher because of higher variance, right? The normal way follows gamma (Erlang) distribution while the compact proof follows exponential distribution?
Right, exactly-- well to be more precise, the compact proof and normal are both gamma; but the compact proof takes the log of the shape, while increasing lambda--- the same expected work, is represented by a fewer number of more rare events such that the expectation is the same but the distribution shape is different.
Technically its possible to make the proofs "over prove", e.g. to go X-work back you have to show a*X expected total work; but I think I concluded that you can only get constant factor improvements that way unless you break the asymptotic log scaling of the proofs. One nice thing is that such techniques are between the prover and verifier, the commitments don't change.