If you can't get multiple independent 256 bit operations in parallel the next alternative is to use deferred carries. Currently your 256 bit number is represented as 4 64-bit 'digits'. If you instead represent it as 5 52-bit digits then you can perform 12 successive additions without overflowing and then process the carries afterwards.
I'm curious though, why 52 bits in particular? For addition/subtraction the safe zone where digits don't clobber carry bits and you can distinguish overflow from a real result is the lower 63 bits, and for multiplication it's the lower 32-bits.
I'm still not sure how to process the carry bits simultaneously though. Presumably I could set the first two [for example] of these "words" at once using something like _mm_set_epi64(bits52[1], bits52[0]), add another set of numbers I made using _mm_add_epi64, but the problem here is how am I going to add the carry bits over to the words without doing a bunch of performance-killing _mm_extract_* instructions, which is slow, because now I have to extract the carry bits for each number one by one instead of in parallel, and is possibly slower than what I'm using now:
// c = a + b
int carry = _add_carry_u64(carry, a.bits64[0], b.bits64[0], c.bits64 + 0)
carry = _add_carry_u64(carry, a.bits64[1], b.bits64[1], c.bits64 + 1)
carry = _add_carry_u64(carry, a.bits64[2], b.bits64[2], c.bits64 + 2)
carry = _add_carry_u64(carry, a.bits64[3], b.bits64[3], c.bits64 + 3)
Not counting loads & stores, this uses 4 ADD instructions and ideally I would like a way to "add" 2 or 4 of these words in one instruction, a second instruction to shift the carry bits one 64-bit (or 32-bit) word to the left and then a third instruction that adds the carry bits to the subsequent words.