Well I've been talking to a few people about this but got no real response from anyone, that it was possible ...
(Woke up with this idea back on the 4th of August ...)
So I guess I need to post in a thread where someone works on a CL kernel and just let them implement it if they don't already do it 

I've written it in pseudo-code coz I still don't follow how the CL file actually does 2^n checks and returns the full list of valid results.
Yeah I've programmed in almost every language known to man (except C# and that's avoided by choice) but I still don't quite get the interface from C/C++ to the CL and how that matches what happens
What I am discussing, is the 2nd call to SHA256 with the output of the first call (not the first call)
Anyway, to explain, here's the end of the SHA256 pseudo code from the wikipedia:
==================
  for i from 0 to 63
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]
    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
    b := a
    a := t1 + t2
  Add this chunk's hash to result:
  h0 := h0 + a
  h1 := h1 + b
  h2 := h2 + c
  h3 := h3 + d
  h4 := h4 + e
  h5 := h5 + f
  h6 := h6 + g
  h7 := h7 + h
Then test if h0..h7 is a share (CHECK0, CHECK1, ?)
==================
Firstly, I added that last line of course.
I understand that with current difficulty, if h0 != 0 then we don't have a share (call this CHECK0)
If h0=0 then check some leading part of h1 based on the current difficulty (call this CHECK1)
... feel free to correct this anyone who knows better 

If a difficulty actually gets to checking h2 then my optimisation can be made even better by going back one more step (adding an i := 61) in the pseudo code shown below
A reasonably simple optimisation of the end code for when we are about to check if h0..h7 is a share (i.e. only the 2nd hash)
==================
 for i from 0 to 61
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]
    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
    b := a
    a := t1 + t2
 i := 62
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]
 tmpa := t1 + t2
 tmpb := h1 + tmpa (this is the actual value of h1 at the end)
 if CHECK1 on tmpb then abort - not a share
  (i.e. return false for a share)
    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
    b := a
    a := tmpa
 i := 63
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]
 tmpa := h0 + t1 + t2 (this is the actual value of h0 at the end)
 if CHECK0 on tmpa then abort - not a share
  (i.e. return false for a share)
    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
 Add this chunk's hash to result:
 h0 := tmpa
 h1 := tmpb
 h2 := h2 + c
 h3 := h3 + d
 h4 := h4 + e
 h5 := h5 + f
 h6 := h6 + g
 h7 := h7 + h
Its a share - unless we need to test h2?
==================
Firstly the obvious (as I've said twice above):
This should only be done when calculating a hash to be tested as a share.
Since the actual process is a double-hash, the first hash should not, of course, do this.
In i=62:
If the tmpb test (CHECK1) says it isn't a share it avoids an entire loop (i=63), the 'e' calculation at i=62 and any unneeded assignments after that
and also we don't care about the actual values of h0-h7 so there is no need to assign them anything (or do the additions) except whatever is needed to affirm the result is not a share (e.g. set h0=-1 if h0..h7 must be examined later - or just return false if that is good enough - I don't know which the code actually needs)
CHECK1's probability of failure is high so it easily cover the issue of an extra calculation (h1 + tmpa) to do it.
In i=63:
If the tmpa test (CHECK0) says it isn't a share it avoids the 'e' calculation at i=63 and any unneeded assigments after that
and also we don't care about the actual values of h0-h7 so there is no need to assign them anything (or do the additions) except whatever is needed to affirm the result is not a share (e.g. set h0=-1 if h0..h7 must be examined later - or just return false if that is good enough - I don't know which the code actually needs)
P.S. any and all mistakes I've made - oh well but the concept is there anyway
Any mistakes? Comments?