Ways to accelerate 256-bit arithmetic with 64-bit ops?

Bitcoin Forum

May 11, 2024, 02:57:09 PM

Welcome, Guest. Please login or register.

News: Latest Bitcoin Core release: 27.0 [Torrent]

Home

Help

Search

Login

Register

More

Bitcoin Forum > Bitcoin > Development & Technical Discussion > Ways to accelerate 256-bit arithmetic with 64-bit ops?

Pages: [1]

« previous topic next topic »

Author

Topic: Ways to accelerate 256-bit arithmetic with 64-bit ops? (Read 224 times)

NotATether (OP)

Legendary

Offline

Offline

Activity: 1596
Merit: 6735

bitcoincleanup.com / bitmixlist.org

WWW

Ways to accelerate 256-bit arithmetic with 64-bit ops?

March 31, 2021, 12:07:59 AM

#1

I am working on some code (Jean Luc PONS kangaroo program to be specific) that currently stores 256-bit ints in blocks of 64-bit integers and sequentially adds/subtracts/multiplies each block using carry bits.

I wanted to vectorize these operations but the carry bits prevent me from doing so. Hardware acceleration (SSE/AVX in particular) will only operate on blocks of 64-bit ints.

Does anyone know any algorithms floating around for basic arithmetic like addition et al that don't need the carry bit, and can be scaled up to pairs of 4 blocks of 64-bit ints?

256-bit numbers are used to represent public and private keys.

.
.^{BLACKJACK^♠FUN.}

███▄██████
██████████████▀
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
█████████████████▄
░██████████████▀
████████████
███████████████░██
██████████

CRYPTO CASINO &
SPORTS BETTING

│

.
REWARD SYSTEM
.ELITE SEAT..

│

▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
███████████████████

.

Whitepaper
JACK

1715439429

Hero Member

Offline

Offline

Posts: 1715439429

Personal Message (Offline)

Ignore

1715439429

1715439429

Reply with quote

#2

1715439429


	Report to moderator

1715439429

Hero Member

Offline

Offline

Posts: 1715439429

Personal Message (Offline)

Ignore

1715439429

1715439429

Reply with quote

#2

1715439429


	Report to moderator

I HATE TABLES I HATE TABLES I HA(╯°□°）╯︵ ┻━┻ TABLES I HATE TABLES I HATE TABLES

Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.

gmaxwell

Moderator
Legendary

expert

Offline

Offline

Activity: 4172
Merit: 8419

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

March 31, 2021, 01:06:33 AM

#2

Quote from: NotATether on March 31, 2021, 12:07:59 AM

I am working on some code (Jean Luc PONS kangaroo program to be specific) that currently stores 256-bit ints in blocks of 64-bit integers and sequentially adds/subtracts/multiplies each block using carry bits.

I wanted to vectorize these operations but the carry bits prevent me from doing so. Hardware acceleration (SSE/AVX in particular) will only operate on blocks of 64-bit ints.

Does anyone know any algorithms floating around for basic arithmetic like addition et al that don't need the carry bit, and can be scaled up to pairs of 4 blocks of 64-bit ints?

256-bit numbers are used to represent public and private keys.

The much easier way to parallelize it is to arrange for things to run 4 independent 256 bit operations in parallel. Tongue

Tongue

(unfortunately, SSE2 etc actually only give you 32 bit multipliers which is a huge performance punch in the gut)

If you can't get multiple independent 256 bit operations in parallel the next alternative is to use deferred carries. Currently your 256 bit number is represented as 4 64-bit 'digits'. If you instead represent it as 5 52-bit digits then you can perform 12 successive additions without overflowing and then process the carries afterwards.

The 64-bit field code in libsecp256k1 works this way.

NotATether (OP)

Legendary

Offline

Offline

Activity: 1596
Merit: 6735

bitcoincleanup.com / bitmixlist.org

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

March 31, 2021, 09:55:07 AM

#3

Quote from: gmaxwell on March 31, 2021, 01:06:33 AM

If you can't get multiple independent 256 bit operations in parallel the next alternative is to use deferred carries. Currently your 256 bit number is represented as 4 64-bit 'digits'. If you instead represent it as 5 52-bit digits then you can perform 12 successive additions without overflowing and then process the carries afterwards.

I'm curious though, why 52 bits in particular? For addition/subtraction the safe zone where digits don't clobber carry bits and you can distinguish overflow from a real result is the lower 63 bits, and for multiplication it's the lower 32-bits.

I'm still not sure how to process the carry bits simultaneously though. Presumably I could set the first two [for example] of these "words" at once using something like _mm_set_epi64(bits52[1], bits52[0]), add another set of numbers I made using _mm_add_epi64, but the problem here is how am I going to add the carry bits over to the words without doing a bunch of performance-killing _mm_extract_* instructions, which is slow, because now I have to extract the carry bits for each number one by one instead of in parallel, and is possibly slower than what I'm using now:

Code:

 // c = a + b
int carry = _add_carry_u64(carry, a.bits64[0], b.bits64[0], c.bits64 + 0)
carry = _add_carry_u64(carry, a.bits64[1], b.bits64[1], c.bits64 + 1)
carry = _add_carry_u64(carry, a.bits64[2], b.bits64[2], c.bits64 + 2)
carry = _add_carry_u64(carry, a.bits64[3], b.bits64[3], c.bits64 + 3)

Not counting loads & stores, this uses 4 ADD instructions and ideally I would like a way to "add" 2 or 4 of these words in one instruction, a second instruction to shift the carry bits one 64-bit (or 32-bit) word to the left and then a third instruction that adds the carry bits to the subsequent words.

.
.^{BLACKJACK^♠FUN.}

███▄██████
██████████████▀
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
█████████████████▄
░██████████████▀
████████████
███████████████░██
██████████

CRYPTO CASINO &
SPORTS BETTING

│

.
REWARD SYSTEM
.ELITE SEAT..

│

▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
███████████████████

.

Whitepaper
JACK

gmaxwell

Moderator
Legendary

expert

Offline

Offline

Activity: 4172
Merit: 8419

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

March 31, 2021, 10:30:26 PM

Merited by NotATether (2), ABCbits (1), A-Bolt (1)

#4

Quote from: NotATether on March 31, 2021, 09:55:07 AM

I'm curious though, why 52 bits in particular? For addition/subtraction the safe zone where digits don't clobber carry bits and you can distinguish overflow from a real result is the lower 63 bits, and for multiplication it's the lower 32-bits.

256/5 = 52. The idea is that you add an extra word, and now you have extra space. You can now you can make 12 additions of these 52 bits in a row without processing carries.

The carry processing would still be mostly sequential, but you delay doing it. So you can add totally in parallel a bunch of times and then after that propagate the carries all at once.

E.g. look at the FE add function in libsecp256k1:

https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52_impl.h#L405

Code:

SECP256K1_INLINE static void secp256k1_fe_add(secp256k1_fe *r, const secp256k1_fe *a) {
    r->n[0] += a->n[0];
    r->n[1] += a->n[1];
    r->n[2] += a->n[2];
    r->n[3] += a->n[3];
    r->n[4] += a->n[4];
}

(with the verification code removed)

-- it's embarrassingly parallel.

NotATether (OP)

Legendary

Offline

Offline

Activity: 1596
Merit: 6735

bitcoincleanup.com / bitmixlist.org

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

April 01, 2021, 06:29:12 AM

#5

Looks like it works for subtraction too.

I suppose that to chain a bunch of multiplications at once I could use 8 blocks with the lower 31 bits used, and the remainder of bits I need to have 256 math go in the highest block since I don't have to worry about overflow, and that'll let me do 1 multiplication without a carry.

For 2 multiplications I'd use the lower 30 bits, for 4 I'd use 29 and so on. This'll allow me to mix many additions/subtractions and multiplications at once and go without carrying for a while.

.
.^{BLACKJACK^♠FUN.}

███▄██████
██████████████▀
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
█████████████████▄
░██████████████▀
████████████
███████████████░██
██████████

CRYPTO CASINO &
SPORTS BETTING

│

.
REWARD SYSTEM
.ELITE SEAT..

│

▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
███████████████████

.

Whitepaper
JACK

bigvito19

Full Member

Offline

Offline

Activity: 706
Merit: 111

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

April 06, 2021, 01:01:24 PM

#6

Quote from: NotATether on April 01, 2021, 06:29:12 AM

Looks like it works for subtraction too.

I suppose that to chain a bunch of multiplications at once I could use 8 blocks with the lower 31 bits used, and the remainder of bits I need to have 256 math go in the highest block since I don't have to worry about overflow, and that'll let me do 1 multiplication without a carry.

For 2 multiplications I'd use the lower 30 bits, for 4 I'd use 29 and so on. This'll allow me to mix many additions/subtractions and multiplications at once and go without carrying for a while.

How is this coming along?

NotATether (OP)

Legendary

Offline

Offline

Activity: 1596
Merit: 6735

bitcoincleanup.com / bitmixlist.org

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

April 06, 2021, 03:49:11 PM

#7

Quote from: bigvito19 on April 06, 2021, 01:01:24 PM

How is this coming along?

I didn't get a chance to try this yet.

.
.^{BLACKJACK^♠FUN.}

███▄██████
██████████████▀
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
█████████████████▄
░██████████████▀
████████████
███████████████░██
██████████

CRYPTO CASINO &
SPORTS BETTING

│

.
REWARD SYSTEM
.ELITE SEAT..

│

▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
███████████████████

.

Whitepaper
JACK

Pages: [1]

Bitcoin Forum > Bitcoin > Development & Technical Discussion > Ways to accelerate 256-bit arithmetic with 64-bit ops?

« previous topic next topic »

Jump to:

Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines