Bitcoin Forum
June 22, 2024, 02:07:32 PM *
News: Voting for pizza day contest
 
   Home   Help Search Login Register More  
Pages: [1] 2 »  All
  Print  
Author Topic: Creating an altcoin that will never perform well on GPUs/FPGAs/ASICs  (Read 1815 times)
Midar (OP)
Newbie
*
Offline Offline

Activity: 8
Merit: 0


View Profile
May 25, 2014, 07:45:10 PM
 #1

Hi!

Being annoyed by the fact that all altcoins are at least mineable on a GPU (since my GPU sucks and my FPGA is quite low-end), I thought about creating an altcoin that will never perform well on a GPU, FPGA or ASIC. So I thought: "What is it that CPUs can do much better than GPUs?" and the answer of course is "Conditional branching!".

So my idea is basically this:
Many coins already chain multiple hash algorithms together, but they do so in an order that is known beforehand. This means it's no problem for a GPU and even on an FPGA/ASIC, you can just chain it. Instead, I want to make the next hash algorithm to use dependent on the output of the previous hash function. This can be done by giving each hash function a number and determining the number of the next hash function to use using (hash % num_hashfn).

So why is this hard for a GPU/FPGA/ASIC? Let's start with the GPU:
GPUs are good at processing lots of data using the same operations, but as soon as you need to branch, their performance is laughable. Since the next algorithm to use is different for each block you try (except for the first hash function to use, as that should be dependent on something like the last hash of the last block in the block chain), you are not using the same operations on a lot of data - you are using different operations on a lot of data and need to look at the output of the last hash to determine which code to execute next. GPUs suck at this, so much that it's commonly recommended to do 10 calculations at once and check at the end which one you need and then throw away the other 9.

Ok, so this is hard on GPUs. But why is it hard for FPGAs/ASICs? Well, there are two options you have to implement this on an FPGA/ASIC. Let's start with option A: Option A is to use a state machine and do the hash depending on the state. At the end, you set the state to the next hash function to use. This requires a flip flop to store the state and a flip flop needs one cycle to change. So you can only do one hash for each cycle, which means you would need to have a really high clock rate. Affordable FPGAs and ASICs usually have quite low clock rates and thus, this would significantly slow them down.
Option B is to use a lot of mux and demux. The output of (hash % num_hashfn) can be given as input to a number of demuxes which then give the wires to the correct hash function module and at the end of the circuit use a mux to select the output from the right hashing functions. However, that would make the circuit extremely complex, making it not fit many times on an FPGA and thus only allow very limited parallelism. The same is true for an ASIC, which would need an extremely large die.

As you can see, this algorithm works quite good on a CPU which has many GHz and is good at branching (well, compared to the others, that is), but is slow on a GPU and either slow or very expensive on an FPGA/ASIC.

So what do people think of this? Is this something people are interested in and I should implement? Any feedback on the design?

// Edit: Possible names for this would be BranchCoin or JumpCoin Wink.
dewdeded
Legendary
*
Offline Offline

Activity: 1232
Merit: 1011


Monero Evangelist


View Profile
May 25, 2014, 09:47:19 PM
 #2

No one want's CPU only coins. Because they would favor botnet owners or credit card fraud artists renting thousands of server instances.
Midar (OP)
Newbie
*
Offline Offline

Activity: 8
Merit: 0


View Profile
May 25, 2014, 09:50:25 PM
 #3

Well, and GPU-only coins favor gamers, FPGA coins favor hardware tinkerers, ASIC coins favor chip makers (they're the ones making the main profit). So what? At least, CPU-only coins can be mined by anyone. CPU-only coins can be mined on servers that are mostly idle. CPU-only coins actually could use energy for mining that is otherwise wasted being idle.
TookDk
Legendary
*
Offline Offline

Activity: 1960
Merit: 1062


One coin to rule them all


View Profile WWW
May 25, 2014, 10:01:48 PM
 #4

Option B is to use a lot of mux and demux. The output of (hash % num_hashfn) can be given as input to a number of demuxes which then give the wires to the correct hash function module and at the end of the circuit use a mux to select the output from the right hashing functions. However, that would make the circuit extremely complex, making it not fit many times on an FPGA and thus only allow very limited parallelism. The same is true for an ASIC, which would need an extremely large die.

You would not need to demux the input to your "hash array", all hash units could hash the same input, and the mux in the end, will select the correct one.

How many different hash algorithm do you have in mind? (approximate?)

Cryptography is one of the few things you can truly trust.
dewdeded
Legendary
*
Offline Offline

Activity: 1232
Merit: 1011


Monero Evangelist


View Profile
May 25, 2014, 10:03:11 PM
 #5

Well, and GPU-only coins favor gamers, FPGA coins favor hardware tinkerers, ASIC coins favor chip makers (they're the ones making the main profit). So what? At least, CPU-only coins can be mined by anyone. CPU-only coins can be mined on servers that are mostly idle. CPU-only coins actually could use energy for mining that is otherwise wasted being idle.
I respect your opinion. But I just wanted to tell you that >95% of the community think otherwise.
Midar (OP)
Newbie
*
Offline Offline

Activity: 8
Merit: 0


View Profile
May 25, 2014, 10:06:11 PM
 #6

Quote
You would not need to demux the input to your "hash array", all hash units could hash the same input, and the mux in the end, will select the correct one.

Well, that means you would need even more hashing units. Without the demux, a lot of hash units would be just unused. So you would be even less efficient.

Quote
How many different hash algorithm do you have in mind? (approximate?)

That is something that would be easy to adjust and should be decided shortly before the initial release. E.g. it would make sense to try out how many are enough to prevent GPU/FPGA/ASIC mining. I'm thinking around 10 for now.

Quote
I respect your opinion. But I just wanted to tell you that >95% of the community think otherwise.
Thanks for your input! Duly noted Smiley.
TookDk
Legendary
*
Offline Offline

Activity: 1960
Merit: 1062


One coin to rule them all


View Profile WWW
May 25, 2014, 10:15:39 PM
 #7

You would not need to demux the input to your "hash array", all hash units could hash the same input, and the mux in the end, will select the correct one.

Well, that means you would need even more hashing units. Without the demux, a lot of hash units would be just unused. So you would be even less efficient.


That is really not a problem having e.g. 10 hash units run in parallel, where 9 is idle all the time. It would greatly outperform any CPU in speed (in 1:1 die technology).
You can even save the power on the 9 unused units by controlling the clock enable for the 9 unused hash units.  

I think you need to have 100-200 different (very different) hash algorithms before ASIC/FPGA would "be in trouble".

Cryptography is one of the few things you can truly trust.
DeathAndTaxes
Donator
Legendary
*
Offline Offline

Activity: 1218
Merit: 1079


Gerald Davis


View Profile
May 25, 2014, 10:21:32 PM
 #8

Well, and GPU-only coins favor gamers botnet operators,

The average botnet has 10,000x the computing power of the average gamer. 
iopq
Hero Member
*****
Offline Offline

Activity: 658
Merit: 500


View Profile
May 25, 2014, 10:24:00 PM
 #9

Well, and GPU-only coins favor gamers botnet operators,

The average botnet has 10,000x the computing power of the average gamer. 
most botnet computers don't have opencl installed
Midar (OP)
Newbie
*
Offline Offline

Activity: 8
Merit: 0


View Profile
May 25, 2014, 10:29:07 PM
 #10

That is really not a problem having e.g. 10 hash units run in parallel, where 9 is idle all the time. It would greatly outperform any CPU in speed (in 1:1 die technology).
You can even save the power on the 9 unused units by controlling the clock enable for the 9 unused hash units.  

The problem is that 9/10 idle means that it need a *lot* of space on the die in order to get decent speeds. That will make it *very* expensive to manufacture a die and thus make it unprofitable.

Quote
I think you need to have 100-200 different (very different) hash algorithms before ASIC/FPGA would "be in trouble".

Why very different? As long as they don't have a reusable update phase, it should be ok.

Well, 100 means 99/100 of the space on the die would be unused almost all the time. The costs should be really high then. I guess even for 10 the costs would be quite high, considering how many you would need to do in parallel to get to CPU speeds with low clockrates. But yes, 100 would be a number that should make very sure that an ASIC is unprofitable and that not enough parallel hashing can be done even on big FPGAs Smiley.

Quote
most botnet computers don't have opencl installed
A botnet operator can do anything with the computer. That includes installing OpenCL.
TookDk
Legendary
*
Offline Offline

Activity: 1960
Merit: 1062


One coin to rule them all


View Profile WWW
May 25, 2014, 10:39:05 PM
 #11

The problem is that 9/10 idle means that it need a *lot* of space on the die in order to get decent speeds. That will make it *very* expensive to manufacture a die and thus make it unprofitable.

The topic was if it would perform well. It would... If it is cost efficient would to develop and produce will depend on the exchange rate of the "jumpcoin". 

Cryptography is one of the few things you can truly trust.
Midar (OP)
Newbie
*
Offline Offline

Activity: 8
Merit: 0


View Profile
May 25, 2014, 10:42:08 PM
 #12

Quote
The topic was if it would perform well. It would... If it is cost efficient would to develop and produce will depend on the exchange rate of the "jumpcoin". 

Well, as outlined in the initial post, option A would not perform and option B would be too expensive. Sure, an ASIC would beat any CPU if you ignore the cost and go for option B. The idea is independent from the exchange rate: If it would be more expensive than CPUs that have the same performance, why do it? The power savings will be quite minor if you have so much unused, yet powered logic (and powering off units takes a clock cycle, bringing you back to the performance bottleneck of option A).
DeathAndTaxes
Donator
Legendary
*
Offline Offline

Activity: 1218
Merit: 1079


Gerald Davis


View Profile
May 25, 2014, 11:01:16 PM
 #13

Well, and GPU-only coins favor gamers botnet operators,

The average botnet has 10,000x the computing power of the average gamer. 
most botnet computers don't have opencl installed

The whole point of a botnet is that the botnet has control over the zombie nodes.  Whatever is needed to accomplish the goals of the botnet will be installed.
hulk
Sr. Member
****
Offline Offline

Activity: 392
Merit: 250


View Profile
May 26, 2014, 12:25:07 AM
 #14

Thats would create botnet like what others mention. I suggest creating a pure GPU coins instead (truely anti FPGA and ASICs).

tokyoghetto
Legendary
*
Offline Offline

Activity: 1232
Merit: 1000


View Profile
May 26, 2014, 12:35:13 AM
 #15

doesn't Myriad solve this whole "what hardware to use" problem by letting everyone mine?
Midar (OP)
Newbie
*
Offline Offline

Activity: 8
Merit: 0


View Profile
May 26, 2014, 12:41:31 AM
 #16

doesn't Myriad solve this whole "what hardware to use" problem by letting everyone mine?
Well, it basically means that people who bought hybrid ASICs that can do SHA-256 and Scrypt at the same time have an advantage, doesn't it? Wink
optimiz3
Newbie
*
Offline Offline

Activity: 37
Merit: 0



View Profile
May 26, 2014, 03:51:21 AM
Last edit: May 26, 2014, 07:10:59 AM by optimiz3
 #17

OP is on the right track.

What I can say is this:

1. X11 is trivial to obliterate via FPGAs and ASICs, and will be obliterated in the near future if X11 coins continue to appreciate.  X11 only requires compute resources, which are cheapest as far as die area goes when dealing with FPGAs and ASICs.  A good rule of thumb is 32 bits of logic roughly equals 32 bits of RAM.  X11 plays towards the strengths of FPGAs and ASICs because it requires no RAM (and minimally, only flip flops for pipeline state management).

2. Designs that focus on branching also play towards the strengths of ASICs and FPGAs.  ASICs and FPGAs evaluate branches in parallel.  Choosing the next hash function to execute based on the current state of a hash doesn't impact performance since you can saturate both hardware branches by temporarily "saving" work in the pipeline, and then "releasing" work so that each subsequent hash unit is always fully occupied.  This is easier than it seems because cryptographically secure hash functions are equally likely to take each branch (otherwise it wouldn't be cryptographically sound as it would be biased, which would lower entropy).

3. When engineering ASICs, going from most expensive to least: RAM > I/O bandwidth* > Compute
  * I/O bandwidth is really a proxy for RAM, think DDR.



ANY design that relies on difficulty to code rather than cost to manufacture WILL fall to ASICs/FPGAs (i.e. X11).



Recommendations:

Scrypt was a good start. Scrypt does the following things well:

1. Focuses on the economics of ASICs and FPGAs (i.e. RAM > Compute)
2. Amplifies RAM cost with cryptographic operations between RAM accesses.  Requiring 64 clock cycles of Salsa20/8 between memory accesses amplifies memory requirements since the memory is tied up until the hash calculation completes.
3. Uses unpredictable state transitions to determine where to read next.


What Scrypt didn't do well and led to its downfall:

1. Used predictable state transitions to determine where to write next.  Because memory was filled in order from 0-1023, it enabled a compute/RAM tradeoff where you could skip every other memory access and compute it instead.  I.e. you could store values 0,2,4,...1022, (ignoring 1,3,5,...1023) and then if something tried to read position 5, lookup 4, and get 5 by taking 4 and running it though Salsa20/8 once.  This significantly weakened the memory cost (this is what the --lookup-gap GPU miner setting does).

2. Used too few cycles between RAM accesses.  Salsa 20/8 ties up memory for ~64 clocks per memory accesses.  Increasing the number of rounds would increase memory requirements without impacting cache on CPU miners.

How to implement a serious CPU-only miner:

1. Don't do something silly like X11 which only adds implementation complexity and no real ASIC cost complexity.

2. Fix predictable memory writing in Scrypt - right now Scrypt looks like:

Code:
for (i = 0; i < 1024; ++i)
{
  RAM[i] = state // Memory access is sequential
  state = Salsa20/8(state)
}

// BETTER fill unpredictable
Code:
for (i = 0; i < 1024; ++i)
{
  RAM.Insert(state.e % i, state) // Memory access is unpredictable
  state = Salsa20/8(state)
}

The 1 line change above causes computability problems that we have no good solution for.  It adds a requirement for O(1) memory access and O(1) memory insert.  The most well known solution for these requirements are hashtables, but hashtables work best when there are no collisions, and keys are immutable.

RAM.Insert() creates a worst-case for hashtables because each insert causes a collision, and modifies the keys of all items after the insert position.  Alternate data structures offer even worse amortized performance (for example trees are O(log(N)) memory access and O(log(N)) memory insert).

Also - we are also performing a modulus with a variable denominator (well...incrementing), which can get expensive in hardware.

3. Switch from Salsa20/8 to something like Sha256 between Scrypt memory accesses.  As mentioned above in Scrypt issue #2, increasing clocks between memory access has the effect of increasing FPGA/ASIC memory requirements without impacting CPU cache requirements.  Also Sha256 is convenient because it uses 32bit internal registers which democratizes it to x86 and ARM devices.

4. Consider floating-point crypto functions; FP operations are very hard to do on FPGAs/ASICs because they require a significant amount of R&D or IP.  If you coupled FP-crypto with the above, you'd be well on your way.


Also - I'm not saying everything here is perfect, but I'd love to have a serious discussion if there is interest.  Finally, wrote this long post quickly, so please excuse any grammar errors / typos.

EDIT: clarity Smiley.
arcke
Newbie
*
Offline Offline

Activity: 46
Merit: 0


View Profile
May 26, 2014, 05:57:21 AM
 #18

I would be looking forward to see a truly ASIC/FPGA defeating algorithm emerge. The majority of the current mining population probably does not care about such a development, but there are still good reasons to do this. We would stop losing mining efficiency to people using specialized hardware, which is expensive and botnets, however nasty, are still very cheap and possibly accessible to many and they would still be competing with cloud miners for a possible 51% attack. This could be an important step towards creating a coin which does not favor the big player and does not create a huge barrier for an idividual to start mining on basically any machine.
Midar (OP)
Newbie
*
Offline Offline

Activity: 8
Merit: 0


View Profile
May 26, 2014, 01:19:28 PM
 #19

A good rule of thumb is 32 bits of logic roughly equals 32 bits of RAM.
What is 32 bits of logic for you? Is one XOR with 32 flip flops 64 bits of logic (storing 32 bit, XORing 32 bit)? That's a really weird "unit" for the size of logic that I've never seen used before.

Quote
X11 plays towards the strengths of FPGAs and ASICs because it requires no RAM (and minimally, only flip flops for pipeline state management).
Flip flops already make it slow, as they need one cycle to change. FPGAs and ASICs usually have slow clock rates, so you want to keep the amount of cycles low. That's why implementing a CPU on an FPGA is dead slow compared to a real CPU. My proposal makes use of this and this is actually option A I mentioned, where you need to use flip flops and need M cycles instead of 1, where M is the number of hashing functions that you chain.

Quote
Designs that focus on branching also play towards the strengths of ASICs and FPGAs.  ASICs and FPGAs evaluate branches in parallel.
It's not that easy. There are two options, see above for option A and option B. One option is slow, the other extremely expensive because you then need a *lot* of unused logic. And option B is not only just expensive: It will also violate timing constraints if it gets too long, meaning it's not possible to program an FPGA that way / build an ASIC. So this already is a pretty good defense against FPGAs and ASCIs. Only a processor manufacturer like Intel could make an ASIC that does this better than a CPU at acceptable costs.

Quote
Choosing the next hash function to execute based on the current state of a hash doesn't impact performance since you can saturate both hardware branches by temporarily "saving" work in the pipeline, and then "releasing" work so that each subsequent hash unit is always fully occupied.
Uhm, there are no branches in an FPGA/ASIC. You can implement a CPU-like state machine to allow "branches", that's it. In this case, that would even be overkill. A state machine would already be enough. But that would need a flip flop, significantly slowing down performance.

Quote
This is easier than it seems because cryptographically secure hash functions are equally likely to take each branch (otherwise it wouldn't be cryptographically sound as it would be biased, which would lower entropy).
Actually, that's what makes it harder to implement on an FPGA/ASIC/GPU: It is not predictable. Therefore you have to either attach a hashing unit for each hash used to the output of the one before and then do the same for each output again, basically meaning that you grow exponentially. So, 10 hashing functions with 10 rounds: 10^10 already! And only 10 of them are used!. That really does not perform well, performance-wise *and* power consumption-wise.

Quote
When engineering ASICs, going from most expensive to least: RAM > I/O bandwidth* > Compute
  * I/O bandwidth is really a proxy for RAM, think DDR.
Well, this is neither. This is logic used on the die, in other words: Number of transistors. Option B makes the number of transistors so high that the efficiency is a joke. Option A makes it so slow that a CPU will be faster.

Quote
ANY design that relies on difficulty to code rather than cost to manufacture WILL fall to ASICs/FPGAs (i.e. X11).
That's just not true. There is a limit of how many transistors you can place on a chip. If that's reached, you have to fall back to doing things serially. Considering that you need slow clock rates in order to be energy efficient, that will make it slow pretty quickly.

Quote
Don't do something silly like X11 which only adds implementation complexity and no real ASIC cost complexity.
As shown above, the non-deterministic part of my proposal adds *real* ASIC cost complexity. 10^10 hashing units for calculating just one block, or 10 hashing units for one block that needs 10 cycles.

However, I do agree that introducing memory into all this will make it even harder. But it will also make it harder for CPUs. Memory is the slowest thing in modern computers.
optimiz3
Newbie
*
Offline Offline

Activity: 37
Merit: 0



View Profile
May 26, 2014, 02:40:23 PM
Last edit: May 26, 2014, 02:56:38 PM by optimiz3
 #20

Here's a good starting point when thinking about the relative costs of gates vs flip flops on ASICs:

http://www.utdallas.edu/~mxl095420/EE6306/Final%20project/tsmc18_component.pdf

While it is only for TSMC 180nm, it is useful for napkin analyses as the relative cell sizes are similar to more advanced process nodes which tend to require NDAs.

XOR2X1 (1 bit XOR, drive strength 1) is 5.0 x 5.3um; DFFX1 is 5.0 x 11.2um.  So actually 2 gates of compute for every bit of memory (reinforces the trade off even more).

Also rotates have no cost as they are just wiring, and XORs are one of the larger logic cell sizes - AND2X1 is 5.0 x  2.6um so 4 gates for every bit of memory.  N-bit adders are slightly more tricky, as they tend to have N log N gates in unoptimized implementations.



If hypothetically you were presented with the following proof-of-work construction:

Code:
HashFns[10] // each is a fn ptr to a unique hash fn
for (int i = 0; i < 10; ++i) {
  state = HashFns[state % 10](state);
}

You would likely implement it on an ASIC/FPGA with a proportionate number of cores relative to the size of each hash function.  Each core would have a FIFO in front.  All branching would do is have you "enqueue" the output of each hash function to the FIFO belonging to the core of the next hash function.  This way all cores would be kept occupied, and no core would need to go idle.  Other than first few clocks, no work or silicon would be wasted. So 10 cores + 10 FIFOs.

This is what I meant by ASICs/FPGAs executing branches in parallel.  Well designed circuits always break up algorithms into separate steps/execution stages, which are then pipelined so each execution stage is kept occupied.


The implementation above is neither option A nor option B.  It exploits the lack of RAM usage to maximize compute area.  I agree that if options A and B were the only ways of implementing the design then yes branching would work, but professional designers would not do options A or B because they are inefficient like you call out Smiley.


Quote
Quote
ANY design that relies on difficulty to code rather than cost to manufacture WILL fall to ASICs/FPGAs (i.e. X11).
That's just not true. There is a limit of how many transistors you can place on a chip. If that's reached, you have to fall back to doing things serially. Considering that you need slow clock rates in order to be energy efficient, that will make it slow pretty quickly.

We're not disagreeing on the transistor limit.  I'm saying that when designing ASICs/FPGAs any time you hit serial execution just means you pipeline.  This is exactly how SHA256d ASICs work (128 serial rounds of SHA in SHA256d).


Separately - the reason cryptographic ASICs tend to run at lower clock rates is for efficiency and not due to any speed limitation of the flip-flops.  Take the (obsolete) 180nm cells above, a DFF (d-flip flop) has a max propagation delay of .28ns, 1second/.28ns = ~3.57GHz.  Clearly there must be another reason for lower clock speeds.

Clock signal propagation is one of the most power expensive events on an ASIC (a pulse that has to propagate through the entire ASIC).  Lowering the clock speed lets you implement more stages per clock since the signal has more time to propagate through gates (more time = lower drive voltage, fewer CMOS signal transitions).  It is a balance because after too many gates the output becomes unstable due to asymmetries in most algorithms.  Another bonus is that if you can do 2x the work per clock, you need fewer flip flops (due to fewer stages, where the FFs are present at the beginning of each stage to capture state), which means less die area and power.

This is why most SHA256 ASICs perform at least 2 SHA rounds per clock/pipeline stage.
Pages: [1] 2 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!