Ways to accelerate 256-bit arithmetic with 64-bit ops?

Bitcoin Forum

October 02, 2024, 01:35:06 AM

Welcome, Guest. Please login or register.

News: Latest Bitcoin Core release: 27.1 [Torrent]

Home

Help

Search

Login

Register

More

Bitcoin Forum > Bitcoin > Development & Technical Discussion > Ways to accelerate 256-bit arithmetic with 64-bit ops?

Pages: [1]

« previous topic next topic »

Author

Topic: Ways to accelerate 256-bit arithmetic with 64-bit ops? (Read 242 times)

NotATether (OP)

Legendary

Offline

Offline

Activity: 1736
Merit: 7298

In memory of o_e_l_e_o

WWW

Ways to accelerate 256-bit arithmetic with 64-bit ops?

March 31, 2021, 12:07:59 AM

#1

I am working on some code (Jean Luc PONS kangaroo program to be specific) that currently stores 256-bit ints in blocks of 64-bit integers and sequentially adds/subtracts/multiplies each block using carry bits.

I wanted to vectorize these operations but the carry bits prevent me from doing so. Hardware acceleration (SSE/AVX in particular) will only operate on blocks of 64-bit ints.

Does anyone know any algorithms floating around for basic arithmetic like addition et al that don't need the carry bit, and can be scaled up to pairs of 4 blocks of 64-bit ints?

256-bit numbers are used to represent public and private keys.

▄▄███████████████████▄▄
▄██▀▀███▄█████▀▀█████▄██▄
██▀██████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
██████████████████████▄██
▀██▀█████▄▄█████▀███▄▄██▀
▀▀███████████████████▀▀

^GODEX.IO ^│ ^{Anonymous Crypto Exchanger}

██
██
██
██
██
██
██
██
██
██
██
██
██

SIGN−UP FREE

LIMITLESS

TRUSTWORTHY

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

gmaxwell

Moderator
Legendary

expert

Offline

Offline

Activity: 4242
Merit: 8702

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

March 31, 2021, 01:06:33 AM

#2

Quote from: NotATether on March 31, 2021, 12:07:59 AM

I am working on some code (Jean Luc PONS kangaroo program to be specific) that currently stores 256-bit ints in blocks of 64-bit integers and sequentially adds/subtracts/multiplies each block using carry bits.

I wanted to vectorize these operations but the carry bits prevent me from doing so. Hardware acceleration (SSE/AVX in particular) will only operate on blocks of 64-bit ints.

Does anyone know any algorithms floating around for basic arithmetic like addition et al that don't need the carry bit, and can be scaled up to pairs of 4 blocks of 64-bit ints?

256-bit numbers are used to represent public and private keys.

The much easier way to parallelize it is to arrange for things to run 4 independent 256 bit operations in parallel. Tongue

Tongue

(unfortunately, SSE2 etc actually only give you 32 bit multipliers which is a huge performance punch in the gut)

If you can't get multiple independent 256 bit operations in parallel the next alternative is to use deferred carries. Currently your 256 bit number is represented as 4 64-bit 'digits'. If you instead represent it as 5 52-bit digits then you can perform 12 successive additions without overflowing and then process the carries afterwards.

The 64-bit field code in libsecp256k1 works this way.

NotATether (OP)

Legendary

Offline

Offline

Activity: 1736
Merit: 7298

In memory of o_e_l_e_o

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

March 31, 2021, 09:55:07 AM

#3

Quote from: gmaxwell on March 31, 2021, 01:06:33 AM

If you can't get multiple independent 256 bit operations in parallel the next alternative is to use deferred carries. Currently your 256 bit number is represented as 4 64-bit 'digits'. If you instead represent it as 5 52-bit digits then you can perform 12 successive additions without overflowing and then process the carries afterwards.

I'm curious though, why 52 bits in particular? For addition/subtraction the safe zone where digits don't clobber carry bits and you can distinguish overflow from a real result is the lower 63 bits, and for multiplication it's the lower 32-bits.

I'm still not sure how to process the carry bits simultaneously though. Presumably I could set the first two [for example] of these "words" at once using something like _mm_set_epi64(bits52[1], bits52[0]), add another set of numbers I made using _mm_add_epi64, but the problem here is how am I going to add the carry bits over to the words without doing a bunch of performance-killing _mm_extract_* instructions, which is slow, because now I have to extract the carry bits for each number one by one instead of in parallel, and is possibly slower than what I'm using now:

Code:

 // c = a + b
int carry = _add_carry_u64(carry, a.bits64[0], b.bits64[0], c.bits64 + 0)
carry = _add_carry_u64(carry, a.bits64[1], b.bits64[1], c.bits64 + 1)
carry = _add_carry_u64(carry, a.bits64[2], b.bits64[2], c.bits64 + 2)
carry = _add_carry_u64(carry, a.bits64[3], b.bits64[3], c.bits64 + 3)

Not counting loads & stores, this uses 4 ADD instructions and ideally I would like a way to "add" 2 or 4 of these words in one instruction, a second instruction to shift the carry bits one 64-bit (or 32-bit) word to the left and then a third instruction that adds the carry bits to the subsequent words.

▄▄███████████████████▄▄
▄██▀▀███▄█████▀▀█████▄██▄
██▀██████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
██████████████████████▄██
▀██▀█████▄▄█████▀███▄▄██▀
▀▀███████████████████▀▀

^GODEX.IO ^│ ^{Anonymous Crypto Exchanger}

██
██
██
██
██
██
██
██
██
██
██
██
██

SIGN−UP FREE

LIMITLESS

TRUSTWORTHY

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

gmaxwell

Moderator
Legendary

expert

Offline

Offline

Activity: 4242
Merit: 8702

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

March 31, 2021, 10:30:26 PM

Merited by NotATether (2), ABCbits (1), A-Bolt (1)

#4

Quote from: NotATether on March 31, 2021, 09:55:07 AM

I'm curious though, why 52 bits in particular? For addition/subtraction the safe zone where digits don't clobber carry bits and you can distinguish overflow from a real result is the lower 63 bits, and for multiplication it's the lower 32-bits.

256/5 = 52. The idea is that you add an extra word, and now you have extra space. You can now you can make 12 additions of these 52 bits in a row without processing carries.

The carry processing would still be mostly sequential, but you delay doing it. So you can add totally in parallel a bunch of times and then after that propagate the carries all at once.

E.g. look at the FE add function in libsecp256k1:

https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52_impl.h#L405

Code:

SECP256K1_INLINE static void secp256k1_fe_add(secp256k1_fe *r, const secp256k1_fe *a) {
    r->n[0] += a->n[0];
    r->n[1] += a->n[1];
    r->n[2] += a->n[2];
    r->n[3] += a->n[3];
    r->n[4] += a->n[4];
}

(with the verification code removed)

-- it's embarrassingly parallel.

NotATether (OP)

Legendary

Offline

Offline

Activity: 1736
Merit: 7298

In memory of o_e_l_e_o

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

April 01, 2021, 06:29:12 AM

#5

Looks like it works for subtraction too.

I suppose that to chain a bunch of multiplications at once I could use 8 blocks with the lower 31 bits used, and the remainder of bits I need to have 256 math go in the highest block since I don't have to worry about overflow, and that'll let me do 1 multiplication without a carry.

For 2 multiplications I'd use the lower 30 bits, for 4 I'd use 29 and so on. This'll allow me to mix many additions/subtractions and multiplications at once and go without carrying for a while.

▄▄███████████████████▄▄
▄██▀▀███▄█████▀▀█████▄██▄
██▀██████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
██████████████████████▄██
▀██▀█████▄▄█████▀███▄▄██▀
▀▀███████████████████▀▀

^GODEX.IO ^│ ^{Anonymous Crypto Exchanger}

██
██
██
██
██
██
██
██
██
██
██
██
██

SIGN−UP FREE

LIMITLESS

TRUSTWORTHY

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

bigvito19

Full Member

Offline

Offline

Activity: 711
Merit: 111

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

April 06, 2021, 01:01:24 PM

#6

Quote from: NotATether on April 01, 2021, 06:29:12 AM

Looks like it works for subtraction too.

I suppose that to chain a bunch of multiplications at once I could use 8 blocks with the lower 31 bits used, and the remainder of bits I need to have 256 math go in the highest block since I don't have to worry about overflow, and that'll let me do 1 multiplication without a carry.

For 2 multiplications I'd use the lower 30 bits, for 4 I'd use 29 and so on. This'll allow me to mix many additions/subtractions and multiplications at once and go without carrying for a while.

How is this coming along?

NotATether (OP)

Legendary

Offline

Offline

Activity: 1736
Merit: 7298

In memory of o_e_l_e_o

WWW

Re: Ways to accelerate 256-bit arithmetic with 64-bit ops?

April 06, 2021, 03:49:11 PM

#7

Quote from: bigvito19 on April 06, 2021, 01:01:24 PM

How is this coming along?

I didn't get a chance to try this yet.

▄▄███████████████████▄▄
▄██▀▀███▄█████▀▀█████▄██▄
██▀██████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
██████████████████████▄██
▀██▀█████▄▄█████▀███▄▄██▀
▀▀███████████████████▀▀

^GODEX.IO ^│ ^{Anonymous Crypto Exchanger}

██
██
██
██
██
██
██
██
██
██
██
██
██

SIGN−UP FREE

LIMITLESS

TRUSTWORTHY

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

Pages: [1]

Bitcoin Forum > Bitcoin > Development & Technical Discussion > Ways to accelerate 256-bit arithmetic with 64-bit ops?

« previous topic next topic »

Jump to:

Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines