Bitcoin Forum
May 02, 2024, 12:08:27 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: Hummingbird Proof of Work - a new design with a pretty diagram to explain it  (Read 149 times)
lokiverloren (OP)
Newbie
*
Offline Offline

Activity: 85
Merit: 0


View Profile
February 25, 2018, 05:37:22 AM
 #1

First with the link, it is a diagram stored on my Google Drive and pretty much covers everything, before I present some spiel about it:

https://drive.google.com/file/d/1kq9_b-n2n-5ZRYviX1YfULO7h2arSoaW/view?usp=sharing

I have developed this after examining the Cuckoo Cycle, and beginning attempts to implement it in Golang. There is already a modified version written in Golang, but it uses much shorter edge bit sizes and cycle lengths (unfortunately not working because of the included siphash amd64 assembler code in it: https://github.com/AidosKuneen/cuckoo )

Briefly, my assessment of the algorithm is that it is overly complex, the solutions will take up a lot of space and can be easily compressed by changing the algorithm, which is exactly what Hummingbird is.

Instead of requiring the actual solution coordinates being explicitly listed in the solution, it contains the nonce, a list of the indexes of the positions in the hash chain (hashes of hashes of hashes, etc) as a 16 bit value that refers to its position, which refers to 1+4x32 bit hashes, 4 for each head and tail of the vectors.

Being that this algorithm uses 32 bits for each coordinate in an Adjacency List format graph, it is likely that for a given compute capacity of the mining network will not require very long cycles to hit the block rate target.

In my analysis, Cuckoo Cycle, in principle, is a type of hash collision search that instead of searching for values with a maximum magnitude, it is just searching for a hash collision, specifically, a chain of hash collisions that do not occur more than twice before the cycle length target. The way that John Tromp designed it, is overly complex, in my view, and probably due to his intensive background in writing hashcash style algorithms in Cuda and OpenCL languages. His algorithm looks, to be precise, for subsets of hashes that collide, and requires a lot of bitwise operations, which, while not expensive, can be dispensed with entirely by using a scheme like this one I have designed.

Essentially, what I have done is make each graph coordinate set its' own individual hash result, and I picked out Murmur3 because of its speed. It is possible that a shorter hash length would make for longer cycles, and perhaps more precision for hitting block rate targets. But at the same time, the solver algorithm (see the link above) will pop out the solution in a fairly linear time based on the total amount of computation and memory access required to exactly find it, which, again as distinct from Cuckoo, has to perform the hash chain generation, sort it, and search it, whereas Hummingbird, by using 7 separate slices, searches and tabulates during each generation of an edge, so the minimum time may be shorter in some cases, and longer in others.

Hash algorithms by their nature will produce uneven hash collisions patterns based on the data, so some solutions will take less time than others. In theory every nonce can eventually be found to have all kinds of lengths given an arbitrary amount of time and data to sort through in order to discover it, but this will be capped at what adds up to an array with 128 bits with a 16 bit index, yielding a maximum of a smidge over 4gb of memory for the hash chain in the worst case, plus 128 bits for each candidate during search (so another 4gb) and again, another 4gb for the worst case for the solution/reject slice arrays.

So, in theory, the algorithm will top out around 12Gb of memory utilisation, which is part of my intention in this, to create an algorithm that requires intensive random access of memory, that usually will exceed the capacity of a current generation, economically viable video card. When this ceases to be the case (maybe another 3-5 years time) it can simply be expanded to use 32 bit array indexes in the solution, enabling a much larger search space to find longer cycles, longer hash lengths (64 bit or 128 bit) also will raise the worst case memory utilisation.

There is a small possibility that the specific hash used may have potential attacks, well, optimisations, that accelerate the process of discovering hash collisions, but by using the requirement of a hash chain, by this it is made more difficult because while you can find hash collisions, the finite field of a hash chain is distinct to the total space, a subset of the total hash collision field. In the case that an attack is found on a specific hash algorithm, it would not be difficult to switch to another hash algorithm in this event, and set back would be attackers a long way, especially if the new hash uses an entirely different category of hashing technique to the one that is compromised.

I suppose also using hash chain fields as the search space also will further weaken quantum attacks as another alternative way to accelerate the search, should qbit based computation become anywhere near competitive to silicon.
Activity + Trust + Earned Merit == The Most Recognized Users on Bitcointalk
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714651707
Hero Member
*
Offline Offline

Posts: 1714651707

View Profile Personal Message (Offline)

Ignore
1714651707
Reply with quote  #2

1714651707
Report to moderator
tromp
Legendary
*
Offline Offline

Activity: 978
Merit: 1080


View Profile
March 18, 2018, 09:32:28 AM
 #2

I'm not sure I understand what Hummingbird solutions look like.
Are they just cycles x -> hash(x) -> hash(hash(x)) -> ... -> x of a fixed length?
If so, then you wouldn't need any memory to find them.

Perhaps you could give a Hummingbird specification like I give for Cuckoo Cycle at
https://github.com/tromp/cuckoo/blob/master/doc/spec
that shows what solutions will be accepted in a verify() function...
lokiverloren (OP)
Newbie
*
Offline Offline

Activity: 85
Merit: 0


View Profile
May 13, 2018, 08:34:02 PM
 #3

Cuckoo Cycle has a maximum of 8192 hashchain elements in a search. It uses a bucket sort to order them in order to find partial collisions, and as such, the amount of time taken for a solution is fairly uniform.

Hummingbird takes a different approach. The chains are much longer, the number space is much wider (at this point I think it will be adjacency lists of 64 bits with 32 bits per head/tail pair), it will be designed to consume a much larger amount of memory (upwards from 12Gb) and I am writing a binary tree algorithm based on the b-heap algorithm that is used in Varnish, except instead of vertical sorting it's lateral, but more generally the same principle, and designed to be very data-cache local so that much of the searching and sorting takes place inside the cache allowing enough time for memory retrieval, and by doing this, bypassing much of the advantage of GPU and potential ASIC acceleration, since most of the action takes place inside the CPU cache (and I have been reading about Ryzen Threadripper 1920x... so much want!).

So, solutions will be found more progressively by the solver, and because of the poisson distribution of any decent hash algorithm (I will be using Highwayhash), the difficulty adjustment will not aim at number sizes but instead numbers of elements in a cycle, and are encoded as the scalar of the hashchain distance from the seed nonce. This means verification will be a little more work than Cuckoo, Momentum or Primecoin's algorithm, but again, because of using Highwayhash, so long as nodes have AVX2, it should not be onerous. The work is about the search of the hashchain sequence, and the requirement of accessing a lot of memory, but with the tree algorithm, latency delay will not be a large component of what makes it 'work' but rather, having to explore such a large number space (likely to be in the range of 16-24 bits of address space to encode the scalars of the hashchain elements.

Because of it being such a large number field being searched, any possible strategy to accelerate the search would require memory faster than the tree algorithm can sort the elements within the cache, and basically this is impossible with an external front side bus, and in fact the fastest hardware to solve it would be a CPU with a bigger cache. This is not a type of chip that can be manufactured by small ASIC fabs, only CPU manufacturers have the equipment and economies of scale to print so much of this fast memory into a processor. Possibly actual memory manufacturers could build them but the point is that the tree algorithm, and the size of the hashchains, preclude any advantage from being greater than commodity DDR4+ memory, CPUs, to an even lesser degree than is the case with Cryptonote and Equihash, in terms of the economics - the cost of producing extremely large SRAM cells on die with a processor can't really be improved over what Intel and AMD already are doing. Furthermore, the tree algorithm I have designed should make it difficult to find significant accelerations in the process.

Of course this is all hypothetical and I am certainly not claiming that I even have understood this right but it makes sense in my mind anyway - that if you can shift a lot of the processing time inside the cache in a search process so that the process is rarely waiting for memory to arrive that it reduces the memory bus load as compared to how much of the search/sort/insert/delete depends on waiting for memory, that is something you simply cannot do without substantial sized caches. Of course the access pattern is not going to be any more deterministic, but walking the tree only requires pulling 3-4 memory pages per walk with the tree I have designed, compared to a brute force sort, which will overflow the cache constantly, moving the bottleneck to the front side bus.

If it works as I hope it does, it also may make the use of mmapped files viable, with swap, on fast flash storage and also reduce the differential between small, low power hardware and larger systems with large DRAM attached, in terms of the cost per solution capability (an ARM 64 bit processor has 4mb of cache, compared to an intel i5 6mb cache, meaning that if cache memory performance becomes a critical factor in solution discovery, these much cheaper chips may work out closer in capital outlay and power consumption as to basically produce a relatively uniform ratio. Or in simpler words, throwing more money at the hardware will not significantly increase the output of solutions, the central focus of ASIC development.
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!