Tenebrix scaling questions

BubbleBoy (OP)

Sr. Member

Offline

Activity: 504
Merit: 250

Tenebrix scaling questions

September 27, 2011, 02:16:25 PM
Last edit: September 27, 2011, 04:52:22 PM by BubbleBoy

Without reading the source, I have to ask how can Tenebrix scale. I've actually thought about memory bound hash-cash myself for a Bitcoin-like system, and I dropped scrypt as unworkable.

In the field of memory hard pricing functions with which scrypt is related, there is the notion of a difference parameter or work ratio: the ratio between evaluating F() and F()^-1, how hard it is to do the proof of work versus the cost of evaluating the proof. For example for hash cash the ratio is exceptionally good: it takes a single hash to check for a huge amount of work, up to 2^128 work for a perfect 256 bit hash. Published memory bound functions achieve a quadratic factor for K in the 10-11 range so it takes one operation to evaluate 2^10 operations worth of work (ABADI et all). This feature is desirable because supposing there's a mail server that checks for spam, you want to force the spammers to do allot of work, and make the mail server do as little work as possible on checking proof of work stamps. Proof-of-work tokens should be easy to check, and hard to mint.

Back to scrypt, this memory hard function is not designed as an e-cash primitive, but as a password derivation function: you set your password, the server does 1 second worth of work, and saves the hashed form. Each time you login the server does the same work for 1 second and checks the resulting hash against the database. Password derivation is designed to be irreversible, you can only compute F() at a high cost, and should have no chance at F()^-1 in a proper system. That is you have no way to extract the password from it's stored hash.

How does this affects Tenebrix ? Well, in order to use scrypt as a pricing function you can do it in two basic ways:
- you make both the miners and those verifying their work compute a difficult variant of scrypt; each time a block is solved and advertised every node on the network must redo the work of the miner (in order to check it before broadcasting it as valid); this clearly does not scale, every node must have a processing power equal to the whole mining network
- you make miners look for an iterated variant of hash-cash based on easy scrypt; it's easy to check a valid mined block however the easy scrypt primitive is no longer memory hard, and can probably be accelerated on a GPU. (I assume this is the solution chosen, since we are talking about hash rates in the K/sec/core)

So my first impression is that Tenebrix fails either at scaling or at being GPU-hostile, but maybe the designers can avert this impression by explaining how the security parameters were chosen so that the network can scale without failing pray to GPU optimized mining.

MIDEX
▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ GET TOKENS ▂▂▂▂

▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
BLOCKCHAIN BASED FINANCIAL PLATFORM # WEB ● ANN + Bounty <
with Licensed Exchange approved by Swiss Bankers and Lawyers > Telegram ● Facebook ● Twitter ● Blog #

ArtForz

Sr. Member

Offline

Activity: 406
Merit: 257

Re: Tenebrix scaling questions

September 28, 2011, 02:59:44 PM

Option #2, giving up most of the memory-hardness to get a primitive useful for a iterated hashing PoW.
Parameters used are N=1024,p=1,r=1 -> 128KiB of V.
Not exactly much, but enough to give current gen GPUs a major headache.
Basically, instead of aiming for "enough memory use to hit external memory on any device", aiming for "small enough to fit in L2 cache on CPUs, big enough to require too much on-chip memory/cache for running enough parallel instances to make GPU/FPGA/VLSI worth it."
If a future generation of GPUs has several MB of fast on-chip cache (which would make em competitive), a forced update with increased memory usage and parallelism is on the wall (e.g. N=4096, p=4 or 8 ).
And it'd still be "fast enough" to not cause too much of an issue with speed of verifying PoW, with N=4096,p=8 still only ~20ms per hash on a current decent CPU core.
Notice that N=4096 p=8 r=1 is one of the parameter sets used in the scrypt paper for relative cost on specialized hardware.
using the same cost/area figures for VLSI as the paper, rough relative estimates:
md5=1
double-sha256=6
scrypt(1024,1,1)=1000000
scrypt(4096,8,1)=40000000

Or let's put it another way, best relative performance achieved real-world for sha256(sha256()), scrypt(1024,1,1), scrypt(4096,2,1) and scrypt(4096,8,1) for a 3GHz K10 core vs. a HD6970
double-sha256 1:120
scrypt(1024,1,1) 1:5.2
scrypt(4096,2,1) 1:3.7
scrypt(4096,8,1) 1:4.4
That's without using vector instructions on the CPU, using a time-memory tradeoff for scrypt on the HD6970 that only stores 1/4 or 1/8 of V values and recreates the missing ones on the fly, straightforward GPU version was slower by a factor of 3-5.

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw

kano

Legendary

Offline

Activity: 4592
Merit: 1851

Linux since 1997 RedHat 4

Re: Tenebrix scaling questions

September 28, 2011, 03:35:01 PM

To repeat what I said in the other thread ...

N=1024,p=1,r=1 -> 128KiB - Seriously? Your kidding right?

The Bitcoin GPU hash code runs about usually 128 parallel hashes and ... is there a GPU card that couldn't handle that?
Apparently you know a bit about OpenCL? (Lolcust seemed to imply that)
Well, either you do and you're hoping to get a lot of these coins, or you don't and missed that 128KiB is really small (or I completely missed something Tongue

)
GPU's can access their own internal memory VERY fast ...

And reading your last table saying:

double-sha256 1:120
and
scrypt(1024,1,1) 1:5.2

What does that mean?
An HD6970 can of course process a double hash around 100 times a CPU
Thus it can do scrypt(1024,1,1) at 5.2 times a CPU?

Pool: https://kano.is - low 0.5% fee PPLNS 3 Days - Most reliable Solo with ONLY 0.5% fee Bitcointalk thread: Forum
Discord support invite at https://kano.is/ Majority developer of the ckpool code - k for kano
The ONLY active original developer of cgminer. Original master git: https://github.com/kanoi/cgminer

Lolcust

Member

Offline

Activity: 112
Merit: 11

Hillariously voracious

Re: Tenebrix scaling questions

September 28, 2011, 11:35:09 PM

Kano, do note the following details

Quote

That's without using vector instructions on the CPU, using a time-memory tradeoff for scrypt on the HD6970 that only stores 1/4 or 1/8 of V values and recreates the missing ones on the fly, straightforward GPU version was slower by a factor of 3-5.

As to whether 128KiB is tiny or not depends on how much on-chip cache you have, and the biggest GPU on chip cache is IIRC 64 kb.

Should future GPUs have phatter cache, as explained above, there is still room for improvement enough to overwhelm the likely cache size of those hypothetical monster-cards.

P.S.: How many cores does a 139 USD phenom have, 6 ?

Per that table, optimized implementation on HD6970 labors to nominally outrun unoptimized implementation on a single that core, and HD6970 costs 349 usd (let's not even mention power consumption)

Geist Geld, the experimental cryptocurrency, is ready for yet another SolidCoin collapse Wink

Feed the Lolcust!
NMC: N6YQFkH9Gn9CTm4mpGwuLB5zLzqWTWFw67
BTC: 15F8xbgRBA1XZ4hmtdFDUasroa2A5rYg8M
GEG: gK5Lx6ypWgr69Gw9yGzE6dsA7kcuCRZRK

kano

Legendary

Offline

Activity: 4592
Merit: 1851

Linux since 1997 RedHat 4

Re: Tenebrix scaling questions

September 28, 2011, 11:49:23 PM

But my comment about that table is that either the first line is wrong or misread.

A HD6970 will do 100x a typical CPU and certainly at LEAST 10x any CPU the same price as it.

That table says double-sha256 1:120
So unless it is totally crap, that can only mean CPU:GPU

Then the next line says scrypt(1024,1,1) 1:5.2
Which would mean GPU is 5.2 times faster.

Now, I have no idea where he pulled those number out of ... but reading that table says GPU will be 5.2 time CPU on N=1024,p=1,r=1
I'm not saying the table is based on real info, just that it doesn't match what's written and just switching the numbers would simply mean they are not reliable at all.

ArtForz

Sr. Member

Offline

Activity: 406
Merit: 257

Re: Tenebrix scaling questions

September 29, 2011, 01:02:19 AM

Quote from: kano on September 28, 2011, 11:49:23 PM

No, thats *one 3GHz K10 CPU core* vs. a HD6970
A 3GHz K10 core is roughly 3Mh/s doing double-sha256 aka bitcoinhash.
A stock HD6970 is about 360-400Mh/s for bitcoinhash.
Hey look, 3:360 == 1:120, a miracle!

So yes, the factor of 1:5.2 is for ONE CPU CORE vs. a 6970.
For the math or comprehension impaired, thats "A hex(=6)core PhenomII at 3GHz is a few % faster than a HD6970"
And this is from actual measurements of real code running on real devices, not some "well, in theory it *should* be possible" numbers that overlook massive problems on real hardware.

So... where is any written performance info for GPUs doing scrypt variants? The original paper only deals with VLSI, any citations?

So I went for the "well, to get some numbers I'll actually have to try to write a optimized implementation" approach.

So... maybe your GPU is 100 times faster than a CPU executing pure ALU code, but to do that you actually need to... like... store the per-thread state somewhere?
So, where to put 128KiB (or 32 KiB using a factor 4 time/space tradeoff) of memory for V *per thread*?

Registers? Way too small.
LDS - works, but you only got 64KiB/CU so you end up with a max of 2 threads/CU even with a factor 4 time/space tradeoff
GDS? way too small again.
L2 read cache? You got 512k of that, so with a /4 reduction you can keep 16 threads worth of V arrays there, but that's not really helping a whole lot due to reasons I'll explain shortly.
External GDDR5? Completely horrible for randomly reading 128 byte items all over the place.

Now, ATI GPUs are funny beasts, they use something best described as "hyperthreading" similar to a niagara CPU to hide their 4-clock instruction latency. Which means you need to be executing 64 threads in parallel to make full use of a CU (16 VLIW4 cores * 4 deep pipeline).

Best option so far is "use LDS, live with only being able to run 2 threads/CU (due to LDS size)... welll... thanks to what I explained above, that means you're effectively getting ALU throughput of half of one VLIW4 core per CU per clock.

So a whole 6970 ends up effectively performing as if you only had 12 fully utilized VLIW4 cores.

Well, turns using this approach in the real world, at 880MHz core clock and with overhead it actually manages roughly 5.2 times the performance of a single 3GHz K10 core.

So, if you have any constructive suggestions on what other approaches to try to wrangle better performance out of a AMD GPU for this variant of scrypt, I'm listening.

Inb4 obvious, first thing I tried was the KISS "just allocate a huge buffer for V arrays in global memory, run 64-256 threads and have the mem and L2 controllers figure it out." approach. also with time/space tradeoff factors of 2 to 16. Ends up slower than going with LDS.

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw

BubbleBoy (OP)

Sr. Member

Offline

Activity: 504
Merit: 250

Re: Tenebrix scaling questions

October 02, 2011, 02:57:03 PM
Last edit: October 02, 2011, 05:35:00 PM by BubbleBoy

I stand corrected, I think it's fair to say the mining algorithm as used in the *brix currencies is memory hard, and GPU/FPGA/ASIC mining has a much worse efficiency ratio over CPU, maybe to the point of being uneconomical.

It's easy to verify the parameters chosen are indeed N=1024,p=1,r=1 by reading the source, and I can confirm that this choice of parameters requires log2(N)*p*r*128 = 128KB of memory. It seems to me this choice still fulfills the security proof of scrypt, i.e there's no shorthand way to compute it without either allocating the required memory, or multiplexing a Salsa20/8 core (thus loosing throughput by the same ratio that you are saving memory).

A Radeon 5870 has 5MB of internal register memory for the shader cores, so even leaving aside all practicalities mentioned by ArtForz, the best you can hope to achieve is 40 mining threads, each significantly slower than a CPU thread due to slower clock and very simple core architecture. I can't find any other place where the algorithm state could fit, certainly the external DDR is not an option due to latency and random access, and the other buffers are tiny. So I think the assessment that a modern multicore CPU is on par with a graphic board when mining *brix is correct.

FPGA ? Usually they have small amounts of RAM and synthesizing RAM is very expensive. A latest generation Stratix V device has around 2600 2.5KB blocks, so it could run 50 mining cores while costing 1000$+. Maybe in large quantities they could be more cost effective than a CPU, depending on the clockrate they can achieve.

Quote

That table says double-sha256 1:120
So unless it is totally crap, that can only mean CPU:GPU

Then the next line says scrypt(1024,1,1) 1:5.2
Which would mean GPU is 5.2 times faster.

It's not that graphic boards are slow at computing scrypt. It's that CPUs are totally crap at the massively parallel hashing required by Bitcoin. CPUs are optimized for desktop kind of tasks, moving large blocks of data to and from the memory and doing some minimal processing to it: text search, signal analysis etc. While mining Bitcoin only a tinny part of the CPU is used; it's vast caches and wide data paths are not touched and the pressure is on the 10-20 ALUs, of which a Radeon 5870 has 1600. So it's easy to understand why a hash specifically designed to exploit the CPU's abilities is so much harder to accelerate on a GPU.

One cannot but wonder, is it a good idea to create a coin that leverages the power of the Desktops in this age of botnets ? FPGAs and certainly ASICs are more efficient in terms of power consumption (scrypt hashes/KWh), but they don't deliver a huge improvement over CPUs. So if most miners are parasitic, there will not be any commercial interest to invest in ASICs: even if they cost nothing to produce (they cost millions), a legit mining corporation will still have to pay for electricity. Meanwhile, the botmasters use free CPUs.

So a CPU-friendly mining algorithm is not at all a democratic one: it puts all the control over the currency in the hands of criminals, the only ones that have free unlimited CPU power.

MIDEX
▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ GET TOKENS ▂▂▂▂

Bobnova

Full Member

Offline

Activity: 210
Merit: 100

Re: Tenebrix scaling questions

October 02, 2011, 03:03:30 PM

If I had a botnet, I'd be pumped.

BTC: 1AURXf66t7pw65NwRiKukwPq1hLSiYLqbP

Lolcust

Member

Offline

Activity: 112
Merit: 11

Hillariously voracious

Re: Tenebrix scaling questions

October 02, 2011, 03:28:41 PM

The botnet issue is addressed in the Tenebrix faq and best discussed in main thread.

Let's keep this thread for discussing ASICs, FPGAs and other scary acronyms, as well as their possible impact on Tenebrix mining Wink

Geist Geld, the experimental cryptocurrency, is ready for yet another SolidCoin collapse Wink

Feed the Lolcust!
NMC: N6YQFkH9Gn9CTm4mpGwuLB5zLzqWTWFw67
BTC: 15F8xbgRBA1XZ4hmtdFDUasroa2A5rYg8M
GEG: gK5Lx6ypWgr69Gw9yGzE6dsA7kcuCRZRK

kjlimo

Legendary

Offline

Activity: 2114
Merit: 1031

Re: Tenebrix scaling questions

October 02, 2011, 04:06:25 PM

#10

where is the block chain stored for tenebrix? Is it the \tenebrix data\database directory which contains two log files so far?

Or is it all contained in the \tenebrix data\blk0001.dat file?

This file doesn't seem to be growing, how big is it when you're all caught up?

Coinbase for selling BTCs
Fold for spending BTCs
PM me with any questions on these sites/apps! http://www.montybitcoin.com

or Vircurex for trading alt cryptocurrencies like DOGEs
CoinNinja for exploring the blockchain.

Lolcust

Member

Offline

Activity: 112
Merit: 11

Hillariously voracious

Re: Tenebrix scaling questions

October 02, 2011, 08:52:30 PM

#11

Quote from: kjlimo on October 02, 2011, 04:06:25 PM

tenebrix_data is the folder where all the guts are.

Geist Geld, the experimental cryptocurrency, is ready for yet another SolidCoin collapse Wink

Feed the Lolcust!
NMC: N6YQFkH9Gn9CTm4mpGwuLB5zLzqWTWFw67
BTC: 15F8xbgRBA1XZ4hmtdFDUasroa2A5rYg8M
GEG: gK5Lx6ypWgr69Gw9yGzE6dsA7kcuCRZRK

bulanula

Hero Member

Offline

Activity: 518
Merit: 500

Re: Tenebrix scaling questions

October 07, 2011, 11:38:28 AM

#12

Please share scrypt GPU miner or it did not happen Tongue

Lolcust

Member

Offline

Activity: 112
Merit: 11

Hillariously voracious

Re: Tenebrix scaling questions

October 07, 2011, 11:46:30 AM

#13

I propose that SC2 GPU port and TBX gpu port be released as a single pak, so that miners can easily compare which is more lucrative to run on a GPU Wink

Geist Geld, the experimental cryptocurrency, is ready for yet another SolidCoin collapse Wink

Feed the Lolcust!
NMC: N6YQFkH9Gn9CTm4mpGwuLB5zLzqWTWFw67
BTC: 15F8xbgRBA1XZ4hmtdFDUasroa2A5rYg8M
GEG: gK5Lx6ypWgr69Gw9yGzE6dsA7kcuCRZRK

kano

Legendary

Offline

Activity: 4592
Merit: 1851

Linux since 1997 RedHat 4

Re: Tenebrix scaling questions

October 07, 2011, 11:47:26 AM

#14

Quote from: bulanula on October 07, 2011, 11:38:28 AM

Please share scrypt GPU miner or it did not happen Tongue

Hmm not sure why you repeated this ... it was funny the first time - but now just silly.

There is none.
Why would anybody even bother to spend the effort to make one?
A lot of effort to mine a soon to be dead or worthless block chain.

Are you a CH acolyte?
Has CH used scrypt in his (never?) to be released SC2.0 and someone wants to see a GPU Miner?

Lolcust

Member

Offline

Activity: 112
Merit: 11

Hillariously voracious

Re: Tenebrix scaling questions

October 07, 2011, 11:50:11 AM

#15

CH has claimed that he used something BETTER and STRONGER than scrypt (whether his algo also improves virility and firearm use performance is yet to be seen)

Geist Geld, the experimental cryptocurrency, is ready for yet another SolidCoin collapse Wink

Feed the Lolcust!
NMC: N6YQFkH9Gn9CTm4mpGwuLB5zLzqWTWFw67
BTC: 15F8xbgRBA1XZ4hmtdFDUasroa2A5rYg8M
GEG: gK5Lx6ypWgr69Gw9yGzE6dsA7kcuCRZRK

bulanula

Hero Member

Offline

Activity: 518
Merit: 500

Re: Tenebrix scaling questions

October 07, 2011, 07:50:28 PM

#16

Quote from: Lolcust on October 07, 2011, 11:50:11 AM

CH has claimed that he used something BETTER and STRONGER than scrypt (whether his algo also improves virility and firearm use performance is yet to be seen)

LOL nice one mate !

AnonyMint

Hero Member

Offline

Activity: 518
Merit: 521

Litecoin CPU-only (excluding GPUs & ASICs) mining possible? (probable solution)

August 15, 2013, 04:49:08 PM
Last edit: August 16, 2013, 03:04:25 PM by AnonyMint

#17

Quote from: BubbleBoy on September 27, 2011, 02:16:25 PM

...how hard it is to do the proof of work versus the cost of evaluating the proof. For example for hash cash the ratio is exceptionally good: it takes a single hash to check for a huge amount of work, up to 2^128 work for a perfect 256 bit hash...

...you make both the miners and those verifying their work compute a difficult variant of scrypt; each time a block is solved and advertised every node on the network must redo the work of the miner (in order to check it before broadcasting it as valid); this clearly does not scale, every node must have a processing power equal to the whole mining network...

Incorrect. All the miners are competing and computing scrypt numerous times trying to guess the correct input to get the required output. Whereas, those verifying that the provided input matches the required output, only have to compute scrypt once.

Quote from: ArtForz on September 28, 2011, 02:59:44 PM

Parameters used are N=1024,p=1,r=1 -> 128KB of V.
Not exactly much, but enough to give current gen GPUs a major headache.
Basically, instead of aiming for "enough memory use to hit external memory on any device", aiming for "small enough to fit in L2 cache on CPUs, big enough to require too much on-chip memory/cache for running enough parallel instances to make GPU/FPGA/VLSI worth it."

Quote from: ArtForz on September 29, 2011, 01:02:19 AM

So... maybe your GPU is 100 times faster than a CPU executing pure ALU code, but to do that you actually need to... like... store the per-thread state somewhere?
So, where to put 128KiB (or 32 KiB using a factor 4 time/space tradeoff) of memory for V *per thread*?

Registers? Way too small.
LDS - works, but you only got 64KiB/CU so you end up with a max of 2 threads/CU even with a factor 4 time/space tradeoff
GDS? way too small again.
L2 read cache? You got 512k of that, so with a /4 reduction you can keep 16 threads worth of V arrays there, but that's not really helping a whole lot due to reasons I'll explain shortly.
External GDDR5? Completely horrible for randomly reading 128 byte items all over the place...

...Best option so far is "use LDS, live with only being able to run 2 threads/CU (due to LDS size)... welll... thanks to what I explained above, that means you're effectively getting ALU throughput of half of one VLIW4 core per CU per clock.

So a whole 6970 ends up effectively performing as if you only had 12 fully utilized VLIW4 cores....

...So, if you have any constructive suggestions on what other approaches to try to wrangle better performance out of a AMD GPU for this variant of scrypt, I'm listening.

Given the roughly order-of-magnitude ratio in measured kHash/Watt efficiency we now see between say a HD 7970 and a (ll cores of a) Core i7 920, I assume your mistake was thinking the latency of reading random 128 byte chunks of GDDR5 would be a significant factor. Given the measured kHash ratio (~700 ÷ 37) of roughly 20 corresponds roughly to ratio in memory bandwidth between the two (except it appears to be a ratio of 1 if CPU is doing all accesses from L1 cache which is 20 times faster than the 25 GB/s), and the cgminer code is not employing a time/space tradeoff factor (i.e. V is 1024 x 32 x 4 bytes and is not computing some elements on the fly to shrink the buffer size), then it appears that both CPU and GPU are memory bound at the maximum memory bandwidth. Since the HD 7970 has 6GB of GDDR5, then 6 x 1024 x 1024 KB ÷ 128 KB = 48 x 1024 threads could be run to maximize memory throughput, because computation in other threads could continue while some of the threads were blocked on waiting for memory to load into local caches and registers. I assume the HD 7970 is either able to load only 128 bytes per random access and maintain full memory bandwidth or it is merging blocked threads in contiguous memory areas and loading their memory accesses synchronized.

In theory, cgminer could gain even more advantage than it has now by implementing a configurable time/space factor, because it is not thread bound (48 x 1024 is greater than 20) and the relative FLOPS between the HD 7970 and the Core i7 920 is 8192 ÷ 25 = 40 (which is greater than 20), thus the HD 7970 is not likely FLOPS bound, so computing some of V[i] instead of loading them from memory should increase kHash (yet I don't know if the incremental energy consumed for this incremental increase in kHash will be at the same efficiency).

Note the Core i7 920 only has 32KB of L1 data cache per core. So even if you are employing all 4 cores by running 4 instances, I don't think you are keeping all 128 KB of memory accesses within L1, unless it is memory bound in L1 and pooler's cpuminer is employing a time/space factor of 4. I fear at 128 KB instead of 32 KB, the CPU's cpuminer may be accumulating memory latency, whereas the GPU's cgminer may not be accumulating as much memory latency.

Quote from: ArtForz on September 28, 2011, 02:59:44 PM

Inb4 obvious, first thing I tried was the KISS "just allocate a huge buffer for V arrays in global memory, run 64-256 threads and have the mem and L2 controllers figure it out." approach. also with time/space tradeoff factors of 2 to 16. Ends up slower than going with LDS.

Wonder why that didn't work? Could it be that you were not associating each V buffer with its own thread ("a huge buffer" instead of 48 x 1024 x 128 KB buffers), thus the thread blocking interaction with memory loading wasn't optimized?

Quote from: ArtForz on September 28, 2011, 02:59:44 PM

If a future generation of GPUs has several MB of fast on-chip cache (which would make em competitive), a forced update with increased memory usage and parallelism is on the wall (e.g. N=4096, p=4 or 8 ).
And it'd still be "fast enough" to not cause too much of an issue with speed of verifying PoW, with N=4096,p=8 still only ~20ms per hash on a current decent CPU core.
Notice that N=4096 p=8 r=1 is one of the parameter sets used in the scrypt paper for relative cost on specialized hardware.

Based on my theory above, the settings to perhaps restore LTC to a CPU-only mined coin, would be to set N x r x p x 128 KB = 2GB (are you sure you meant p above and not r?), so that a GPU is limited to running a few threads. When thinking about the time cost of verifying (speed of computing) this hash, note it could be done once for the entire pool. Success of this strategy relies on the (apparently false by a factor 20, i.e. up to 500 ÷ 25 if cpuminer was keeping all accesses with L1 cache, which I doubt) assumption that memory bandwidth does not drop significantly (given more accesses will be from outside the caches), or that current cpu scrypt is FLOPs bound. Probably also want to maximize the number of threads p (up to the CPU's hardware threads) so as to minimize cumulative memory latency losses on the CPU (but then the GPU gets these additional threads too). Note even if the assumption above is false, reducing the GPU to just a few threads, might inhibit its ability to avoid cumulative memory latency losses (as compared to running up to 48 x 1024 threads, bcz each CPU core is more powerful than GPU core and the CPU is more optimized for lower # of hardware threads).

The other strategy to try is to reduce N x r x 128 KB to 32 KB to fit into L1 cache and see if that raises the speed of the cpuminer by a significant factor if the CPU is not FLOPS (compute) bound in the 256 KB L2 cache (actually 64KB since split between the 4 cores of the Core i7). More over the L1 cache speed needs to be faster than the memory bandwidth of the GPU and the CPU has to be L1 cache memory bound.

Also as an improvement over the original scrypt, run Salsa20 in step 2 of the scrypt ROMix and Salsa20/8 (or Chacha/4?) in step 3, so that time/space factors become less advantageous (even though cgminer is apparently not yet employing them). The hash in step2 should be made as slow as possible without step 2 becoming significant compared to step 3. To accomplish this and as another improvement over the original scrypt, I see no reason why the N in the for loop (not the mod N) couldn't be a larger value in step 3 than step 2. Note as Percival pointed out in his discussion about memory latency, memory bandwidth to FLOPS will continue to reduce, so the above adjustments either have to adjust to the future.

Because you (assumed you) kept CPU memory accesses within L1, you apparently did not concern yourself with the latency overhead that Colin Percival mentioned in section 6 of his scrypt paper. If N x r x p is increased, then r will need to be large enough to make latency insignificant, else the CPU would be at a disadvantage to the GPU and its apparently superior thread blocking memory controller (which exists because it has more hyperthreads to justify it).

The above two paragraphs insure the modified scrypt is memory bound against ASICs (and FPGAs) regardless of whether any improvement relative of CPU vs. GPU relative efficiency is achieved. Forcing ASICS to be memory bound means they are unlikely to significantly beat GPUs ever.

If the above ideas for staying within the 32KB cache on CPU (but not on GPU), or masking latency on the CPU with multiple threads but allowing insufficient threads on the GPU at 2+GB, does not sufficiently improve the CPU vs. GPU relative efficiency, then the other ideas to try are as follows. I am reasonably sure the first idea below will solve the problem!

Run Scrypt of an Scrypt, such that the inner Scrypt is operating on 32KB or 64KB blocks (to stay within L1 or L2 cache) and the outer Scrypt is pseudo-randomly picking one of those blocks from XXGB of blocks based on Integrify() of the result of the inner Scrypt. So this will virtually eliminate the memory latency and keep it running at cache memory bandwidth (if it is not compute-bound), yet the GPU will not be able to run as many threads because of the XXGB requirement.
GPU's GDDR5 RAM has up to (double an) order-of-magnitude more memory bandwidth than the CPU's DDR3 RAM (with upcoming CPU's DDR4 to double memory bandwidth), thus in large memory strategy (to prevent ASICs too) probably the only sure way to exclude GPUs is the CPU-only memory-bound scrypt could use more GDDR5 memory than any GPU will have-- which is currently 8GB for the upcoming Sony PS4 at 176 GB/s. GDDR5 is estimated to be double the cost of DDR3 (not including additional cooling and more expensive memory controllers and buses), so a CPU-only user would only need to spend less than $50 - 100 to upgrade to 8 - 16GB whereas there isn't much demand for a CPU with 8 -16GB of GDDR5, so even if was available it would probably be in the $400 - 550+ range (currently), i.e half (or more) of an order-of-magnitude. This would only run a 8+ thread on the GPU so in theory it would suffer latency slowdown relative to the CPU running 8+ threads (bcz each CPU core is more powerful than GPU core and the CPU is more optimized for lower # of hardware threads). Given the 18 - 24 months doubling in Moore's law, the ROI on the relative capital outlay on the GPU might outweigh any remaining efficiency advantage of the GPU. Stopping fools from wasting their (and society's) capital on expensive GPU debt-fueled obsessions would make for a more sane and less boom & bust economy.

The 8 - 16GB requirement would reduce the scale of Botnet attacks (since most hijacked computers would need to add DDR RAM). Although the average RAM in a new consumer-grade computer today is probably 6GB, not all the old computers have that, thus 8 - 16GB would favor those who do serious desktop computing (since they likely already have that much DDR RAM), which means we put more capital in the hands of programmers and those who are advancing technology (all human improvement is due to technology as the rest of society is always trying to get something for free via debt and socialism). They would run the miner on their desktop when they weren't working. This might increase demand for DDR RAM which has been declining with the decline of the desktop due to the rise of mobile devices, to help advance DDR4.

Probably want to scale the memory requirement by Moore's Law, but what the heck will IT professionals need 500 - 1600 GB of RAM for in a decade other than this? A peer-to-peer cloud (but wouldn't that need to be on always)? A cache of the web including a proliferation of apps and modules (which gets stored loaded off hard disk every day)?

The price per GB has been historically scaling at halving every 18 months between at least 1990 and 2010, i.e. 0.5 ^ (20 ÷ 1.5) = 1÷10,000 whereas 0.5 ^ (20 ÷ 2) = 1÷1024. The GB in a consumer-grade computer also roughly corresponds to the 10,000-to-1 ratio given the Commodore Amiga 600 for $500 in 1992 had 1MB (i.e. 1÷1024 GB) and Commodore 128D for $500 in 1987 had 128KB (i.e. 1÷8192 GB, 4.5 yrs is factor of 8 = 2 ^ (4.5÷1.5)). Note I had a an Atari ST 520 with 512KB computer in 1985, then I produced one of the world's first commercial consumer-grade wysiwyg word processors Word Up, so this reinforces my point that the technology movers will have more RAM than average. Before 1990, there may have been more demand for memory because it was far too minuscule and after 2010, now we may experience less demand due to the rise of mobile computing cannibalizing desktops.

GPU memory appears to be scaling at a doubling every 1.8 years (21 months). So if we scaled at this slower rate and started at say 8GB (above the average of roughly 6GB for a new computer), then eventually the average for computer would eventually catch up, and thus mining of the coin would become more inclusive over time.
Quote from: tacotime on March 06, 2012, 04:18:16 PM
There's also a tree search algorithm here which runs significantly faster with smaller data sets on CPUs/MICA versus GPUs; if it could be integrated into an encryption algorithm it might destroy GPU performance

www.webislands.net/pubs/FAST__SIGMOD10.pdf

For the non-Larrabee CPU architecture it is only 2X faster within L2 cache where it is compute-bound. But small memory is where ASICs (FPGA) may be able compete? And the GPU could run multiple instances in parallel to overcome this advantage. The vaporware Larrabee (AMD has a similar architecture purportedly planned for 2014) is faster than GPU in large memory, but only because the "high speedup over GPUs comes from the inefficiency of performing horizontal SIMD operations on GPUs". I wonder if GPUs could gain that capability by the time that new CPU architecture is released.

Quote from: ssvb on February 17, 2012, 09:35:28 AM

Right now salsa20/8 is used as a core part of of scrypt (8 rounds for the inner loop). Replacing it with something like salsa20/2 (just two rounds in the inner loop) would significantly improve performance on CPU, because 4x less calculations would be involved. And the memory access pattern would remain the same, resulting in almost no improvement for the miners which depend on memory performance (GPU miners and also to some extent Cell/BE miner). So what's the problem? The variants of salsa20 with lower number of rounds are supposedly significantly less secure, at least in some cases:

Indeed that would sacrifice the uniformity of the distribution in step 3 of ROMix, which might open an attack to (only compute some unique paths and thus) violate Percival's criteria that Integrify(H(x)) mod N is not significantly faster than H(x).

Indeed Percival mentioned in section 6 as one of his 5 criteria, that the hash (and my improvement only in step 3 which btw should also employ SIMD (SSE)) computation cycles should minimized.

~~unheresy.com - Prodigiously Elucidating the Profoundly Obtuse~~THIS FORUM ACCOUNT IS NO LONGER ACTIVE

AnonyMint

Hero Member

Offline

Activity: 518
Merit: 521

Re: Litecoin CPU-only (excluding GPUs & ASICs) mining possible? (probable solution)

August 16, 2013, 04:15:59 PM

#18

The idea for nested scrypt is to keep the duty-cycle of memory-bound to L1 cache near to 100% and it also provides the flexibility to control the duty-cycle of the main memory, i.e. 100% - percentage of the time that the algorithm is memory-bound in L1 cache while not writing out to the main memory. Lowering the latter duty-cycle will increase the execution-time of the hash, yet also lower the individual and society-wide electricity requirements of the coin, because for the same hardware less electricity is consumed.

~~unheresy.com - Prodigiously Elucidating the Profoundly Obtuse~~THIS FORUM ACCOUNT IS NO LONGER ACTIVE

AnonyMint

Hero Member

Offline

Activity: 518
Merit: 521

Re: Litecoin CPU-only (excluding GPUs & ASICs) mining possible? (probable solution)

August 18, 2013, 02:15:57 AM

#19

Elaboration:

https://bitcointalk.org/index.php?topic=267522.msg2955080#msg2955080

~~unheresy.com - Prodigiously Elucidating the Profoundly Obtuse~~THIS FORUM ACCOUNT IS NO LONGER ACTIVE

Pages: [1]

Bitcoin Forum > Alternate cryptocurrencies > Altcoin Discussion > Tenebrix scaling questions

« previous topic next topic »