Bitcoin Forum
May 11, 2024, 02:57:09 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: Ways to accelerate 256-bit arithmetic with 64-bit ops?  (Read 224 times)
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1596
Merit: 6735


bitcoincleanup.com / bitmixlist.org


View Profile WWW
March 31, 2021, 12:07:59 AM
 #1

I am working on some code (Jean Luc PONS kangaroo program to be specific) that currently stores 256-bit ints in blocks of 64-bit integers and sequentially adds/subtracts/multiplies each block using carry bits.

I wanted to vectorize these operations but the carry bits prevent me from doing so. Hardware acceleration (SSE/AVX in particular) will only operate on blocks of 64-bit ints.

Does anyone know any algorithms floating around for basic arithmetic like addition et al that don't need the carry bit, and can be scaled up to pairs of 4 blocks of 64-bit ints?

256-bit numbers are used to represent public and private keys.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
1715439429
Hero Member
*
Offline Offline

Posts: 1715439429

View Profile Personal Message (Offline)

Ignore
1715439429
Reply with quote  #2

1715439429
Report to moderator
1715439429
Hero Member
*
Offline Offline

Posts: 1715439429

View Profile Personal Message (Offline)

Ignore
1715439429
Reply with quote  #2

1715439429
Report to moderator
I HATE TABLES I HATE TABLES I HA(╯°□°)╯︵ ┻━┻ TABLES I HATE TABLES I HATE TABLES
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4172
Merit: 8419



View Profile WWW
March 31, 2021, 01:06:33 AM
 #2

I am working on some code (Jean Luc PONS kangaroo program to be specific) that currently stores 256-bit ints in blocks of 64-bit integers and sequentially adds/subtracts/multiplies each block using carry bits.

I wanted to vectorize these operations but the carry bits prevent me from doing so. Hardware acceleration (SSE/AVX in particular) will only operate on blocks of 64-bit ints.

Does anyone know any algorithms floating around for basic arithmetic like addition et al that don't need the carry bit, and can be scaled up to pairs of 4 blocks of 64-bit ints?

256-bit numbers are used to represent public and private keys.

The much easier way to parallelize it is to arrange for things to run 4 independent 256 bit operations in parallel. Tongue  (unfortunately, SSE2 etc actually only give you 32 bit multipliers which is a huge performance punch in the gut)

If you can't get multiple independent 256 bit operations in parallel the next alternative is to use deferred carries.   Currently your 256 bit number is represented as 4  64-bit 'digits'.  If you instead represent it as 5 52-bit digits then you can perform 12 successive additions without overflowing and then process the carries afterwards.

The 64-bit field code in libsecp256k1 works this way.
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1596
Merit: 6735


bitcoincleanup.com / bitmixlist.org


View Profile WWW
March 31, 2021, 09:55:07 AM
 #3

If you can't get multiple independent 256 bit operations in parallel the next alternative is to use deferred carries.   Currently your 256 bit number is represented as 4  64-bit 'digits'.  If you instead represent it as 5 52-bit digits then you can perform 12 successive additions without overflowing and then process the carries afterwards.

I'm curious though, why 52 bits in particular? For addition/subtraction the safe zone where digits don't clobber carry bits and you can distinguish overflow from a real result is the lower 63 bits, and for multiplication it's the lower 32-bits.

I'm still not sure how to process the carry bits simultaneously though. Presumably I could set the first two [for example] of these "words" at once using something like _mm_set_epi64(bits52[1], bits52[0]), add another set of numbers I made using _mm_add_epi64, but the problem here is how am I going to add the carry bits over to the words without doing a bunch of performance-killing _mm_extract_* instructions, which is slow, because now I have to extract the carry bits for each number one by one instead of in parallel, and is possibly slower than what I'm using now:

Code:
 // c = a + b
int carry = _add_carry_u64(carry, a.bits64[0], b.bits64[0], c.bits64 + 0)
carry = _add_carry_u64(carry, a.bits64[1], b.bits64[1], c.bits64 + 1)
carry = _add_carry_u64(carry, a.bits64[2], b.bits64[2], c.bits64 + 2)
carry = _add_carry_u64(carry, a.bits64[3], b.bits64[3], c.bits64 + 3)

Not counting loads & stores, this uses 4 ADD instructions and ideally I would like a way to "add" 2 or 4 of these words in one instruction, a second instruction to shift the carry bits one 64-bit (or 32-bit) word to the left and then a third instruction that adds the carry bits to the subsequent words.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4172
Merit: 8419



View Profile WWW
March 31, 2021, 10:30:26 PM
Merited by NotATether (2), ABCbits (1), A-Bolt (1)
 #4

I'm curious though, why 52 bits in particular? For addition/subtraction the safe zone where digits don't clobber carry bits and you can distinguish overflow from a real result is the lower 63 bits, and for multiplication it's the lower 32-bits.

256/5 = 52. The idea is that you add an extra word, and now you have extra space. You can now you can make 12 additions of these 52 bits in a row without processing carries.

The carry processing would still be mostly sequential, but you delay doing it.  So you can add totally in parallel a bunch of times and then after that propagate the carries all at once.

E.g. look at the FE add function in libsecp256k1:


https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52_impl.h#L405

Code:
SECP256K1_INLINE static void secp256k1_fe_add(secp256k1_fe *r, const secp256k1_fe *a) {
    r->n[0] += a->n[0];
    r->n[1] += a->n[1];
    r->n[2] += a->n[2];
    r->n[3] += a->n[3];
    r->n[4] += a->n[4];
}

(with the verification code removed)

-- it's embarrassingly parallel.
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1596
Merit: 6735


bitcoincleanup.com / bitmixlist.org


View Profile WWW
April 01, 2021, 06:29:12 AM
 #5

Looks like it works for subtraction too.

I suppose that to chain a bunch of multiplications at once I could use 8 blocks with the lower 31 bits used, and the remainder of bits I need to have 256 math go in the highest block since I don't have to worry about overflow, and that'll let me do 1 multiplication without a carry.

For 2 multiplications I'd use the lower 30 bits, for 4 I'd use 29 and so on. This'll allow me to mix many additions/subtractions and multiplications at once and go without carrying for a while.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
bigvito19
Full Member
***
Offline Offline

Activity: 706
Merit: 111


View Profile
April 06, 2021, 01:01:24 PM
 #6

Looks like it works for subtraction too.

I suppose that to chain a bunch of multiplications at once I could use 8 blocks with the lower 31 bits used, and the remainder of bits I need to have 256 math go in the highest block since I don't have to worry about overflow, and that'll let me do 1 multiplication without a carry.

For 2 multiplications I'd use the lower 30 bits, for 4 I'd use 29 and so on. This'll allow me to mix many additions/subtractions and multiplications at once and go without carrying for a while.

How is this coming along?
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1596
Merit: 6735


bitcoincleanup.com / bitmixlist.org


View Profile WWW
April 06, 2021, 03:49:11 PM
 #7

How is this coming along?

I didn't get a chance to try this yet.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!