Bitcoin Forum
October 18, 2017, 06:05:46 AM
 News: Latest stable version of Bitcoin Core: 0.15.0.1  [Torrent]. (New!)
 Home Help Search Donate Login Register
 Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 Author Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013)  (Read 425841 times)
fpgaminer
Hero Member

Offline

Activity: 560

 July 27, 2011, 04:57:26 AM

Quote
The other thing I've been told is your FPGA should never really exceed 60-70% usage pre-routing.... because a lot of resources are needed to get high-speed routing done...  And trying to pack 2 engines in there is likely nearing more like 90% usage...
Probably, but a lot of the design gets optimized in routing so it's hard to actually use less pre-routing

Quote
For example, what are the different length shift registers for?
Here is a quick overview of SHA-256, which will be helpful to explain those shift registers. There are 4 steps to the SHA-256 algorithm. The Pseudo Code on that wiki page is very helpful.

Step 1) Pre-processing. This is done for us already and is not handled by the FPGA (it's done by bitcoind when you perform a getwork request). Just ignore it; it isn't important.
Step 2) W calculation. The W array is 64 32-bit words, and is calculated from the 512-bit input data. The first 16 words are a 1-to-1 copy of the 512-bits of data. The remaining words are calculated like so: w := w[i-16] + s0(w[i-15]) + w[i-7] + s1(w[i-2])
Step 3) Given the starting state (8 32-bit words, labelled A through H totaling 256-bits), perform 64 iterations of the round calculations. Each round takes as input its respective W value from the W array and the state of the previous round (initial state for the first round). It gives as output a new state.
Step 4) Add the output state from the last round to the starting (initial) state. This is done per 32-bit words. So A from initial state is added to A from final state, and so forth.

Step 3 is the easiest to understand and implement, because its inputs and outputs are obvious and consistent. They work exactly how you expect. Take input, do binary and arithmetic logic, and register the output. String 64 of those together in a chain and go make some tea.

Step 2 can also be easy to implement. You take 512-bits of data as input, monkey around with it, and register 512-bits as output. String a bunch of those together and bake some cupcakes. This is implemented in the non-optimized variation of the FPGA code, as seen in src/sha256_transform.v. In fact the whole of that unoptimized variation was quite a bit easier to comprehend.(1)

However, the compilers don't like that way of handling the calculation of W. While you do indeed need to keep 16 words floating around at any given time, only 4 of them are used for each calculation in each instance. So it's a bit wasteful to hand all 16 words of W to the next module in the chain, when it only needs 4 of those words. makomk optimizes this by explicitly using shift registers to get each word of W to where it specifically belongs. This is why you see a bunch of shift registers will all sorts of different lengths. They are carrying the words of W to their various destinations, and how long the shifter needs to be often depends on if its one of the first links in the chain or not.

In theory, a smart compiler could figure this out on its own; realize that a chain of registers is being formed and turn that into a shifter. But in reality that does not seem to be the case, so makomk makes it explicit. It probably also helps the compiler to see optimizations of constant data (most of the input data is constant for the first sha_transform instance).

Quote
So you have a variable K - want to see how hard it is to search a document for references to the letter K?
In Vim, put your cursor on K and press *. It finds every instance of K for me, and nothing extraneous. ...  Okay, sorry, I don't mean to be a smartass. I do indeed see your point, and I appreciate the feedback very much. I had the same pains as you before I started using Vim.

So would wires be labelled sig_ and registers reg_?

Quote
For example, wtf does cur_w1 mean?
Current W array index i-1 ... err, I think. Or it means current W array i - 15.
No, I agree, those are confusing. I will take the time to clean it up and document.

Quote
or a _t1
T1 is one of the intermediary calculations in SHA-256. See the wiki pseudo code I linked.

Quote
But truthfully my understanding of the sha256 algorithm and it's pipelined version are probably a little bit lacking, and that is not helping to understand the code/flow.
I hope the quick overview I wrote above helps. Luckily SHA-256 is not too complex. I initially spent about a day or two mulling it all over in my head before I had a firm grasp over everything and how it gets implemented. If you want, check out the first commited version of sha256_transform.v: first commit. It's just two modules, and doesn't support anything fancy. Just fully unrolled and dead simple. It's probably a great place to start after you have a quick glance over the wiki pseudo code for SHA-256. From there you can look at the current src/sha256_transform.v I linked above, which includes the flexible loop unrolling parameter. And then look at the current makomk version of the code. That's exactly how the code has evolved, so it's a great way to take it step by step.

(1) Please note that the way this code handles LOOP_LOG2>0 is different from makomk's code. In this code, if LOOP_LOG2=2 then HASHERS[0] is responsible for rounds 0 through 3, as an example. i.e. the hashers feed back into themselves. In the makomk code you instead have the last hasher feed into the first. As far as I know this is necessary to get the code to fold correctly with his optimizations.

Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1508306746
Hero Member

Offline

Posts: 1508306746

Ignore
 1508306746

1508306746
 Report to moderator
1508306746
Hero Member

Offline

Posts: 1508306746

Ignore
 1508306746

1508306746
 Report to moderator
1508306746
Hero Member

Offline

Posts: 1508306746

Ignore
 1508306746

1508306746
 Report to moderator
TheSeven
Hero Member

Offline

Activity: 504

FPGA Mining LLC

 July 27, 2011, 06:18:51 AM

For LOOP_LOG2>0 it is a bit more confusing. The hashers are still in a chain, but there are less of them (by some power of two), and the last one in the chain feeds the first one. That allows work to loop around the hasher chain a number of times to complete all 64 needed rounds. As an example, a unit of work will enter in at HASHERS[0], get worked on until HASHERS[31], and then get fed back into HASHERS[0] to get worked on another 32 times (for a total of 64 rounds). This feedback occurs on alternating clock cycles as determined by the cnt signal. When cnt==0 HASHERS[0] processes new work. On all other cnt HASHERS[0] processes old work from the last hasher in the chain.

This would (for LOOP_LOG2=1) mean, that it will need to be fed a burst of work during 32 clocks, and then nothing at all during the next 32 clocks, which seems to be unneccessarily complicated to me.

Mine works in a different way. It feeds back a stage's outputs to its own inputs on every second clock, making HASHERS[0] handle rounds 0 and 1, and HASHERS[31] handle rounds 62 and 63. This way it just seems to work at half the frequency externally, no need for bursts.

However, the compilers don't like that way of handling the calculation of W. While you do indeed need to keep 16 words floating around at any given time, only 4 of them are used for each calculation in each instance. So it's a bit wasteful to hand all 16 words of W to the next module in the chain, when it only needs 4 of those words. makomk optimizes this by explicitly using shift registers to get each word of W to where it specifically belongs. This is why you see a bunch of shift registers will all sorts of different lengths. They are carrying the words of W to their various destinations, and how long the shifter needs to be often depends on if its one of the first links in the chain or not.

Hm, this is weird. Maybe somehow related to verilog?
With my VHDL code, it tries to implement every single occurrence of two registers in a row as a shift register, leading to lots of unused flipflops and an unroutably high LUT usage. Limiting that to at least 4 registers in a row in the synthesis options improved things a lot.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
fpgaminer
Hero Member

Offline

Activity: 560

 July 27, 2011, 07:34:45 AM

Quote
This would (for LOOP_LOG2=1) mean, that it will need to be fed a burst of work during 32 clocks, and then nothing at all during the next 32 clocks, which seems to be unneccessarily complicated to me.
It's fed "new" work every other clock cycle, confusingly enough. HASHERS[0], for example, will take new work at the first clock, old work at the second, new work at the third, old work at the fourth, and so on. It's really not a terrible way to do it, except for the long feedback chain required between the last and the first hasher, which is fine if those two instances are placed next to one another. ... and the fact that it's confusing and makes my brain squeal.

It may even be better, to be honest, because you have feedback in only one location instead of at every hasher. If you make a U shape out of the hasher placement it puts the first and last hasher next to each other for fast feedback there, and the rest can happily route forward. K_next is the only signal that changes the behavior of the hashers at each cycle, and is likely implemented as a 2:1 MUX at LOOP_LOG2=1.

Quote
Mine works in a different way. It feeds back a stage's outputs to its own inputs on every second clock, making HASHERS[0] handle rounds 0 and 1, and HASHERS[31] handle rounds 62 and 63. This way it just seems to work at half the frequency externally, no need for bursts.
That is how the normal code works. This is only applicable to makomk's latest revisions, where I think he chose to do a long chain loop instead of tight feedback because of how the W shift registers are implemented at LOOP_LOG2>0. He'll have to chime in here to verify, as I haven't worked all the logic out for myself.

Quote
Hm, this is weird. Maybe somehow related to verilog?
With my VHDL code, it tries to implement every single occurrence of two registers in a row as a shift register, leading to lots of unused flipflops and an unroutably high LUT usage. Limiting that to at least 4 registers in a row in the synthesis options improved things a lot.
Quite possibly. I am mostly just happy to finally have a routable design running on my LX150 at 100MHz  At least on Altera, Quartus couldn't figure out how to shove things into the M9Ks with vanilla code. makomk's version with explicit shifters is what got Quartus to use the M9Ks and cram the design into 75K LEs.

Perhaps the way VHDL is synthesized allows the compiler to better realize chains of registers? It seems odd, but I wouldn't put anything past these compilers

makomk
Hero Member

Offline

Activity: 686

 July 27, 2011, 01:23:02 PM

edit: update
if I use this for the K and K_next assignment when LOOP == 1, I don't get the LUT messages anymore:
Quote
`ifdef USE_RAM_FOR_KS
if ( LOOP == 1) begin
assign K = Ks_mem[ i ];
assign K_next = Ks_mem[ i + 1 ];
end else begin
...
I think the problem is that K and K_next are not assigned in a clock state, thus they become asynchronous combinatorial logic - and XST can't map that to a ROM?  Or maybe it's the addition of using a multiplier output as an address selector?  Something in there XST wasn't liking for me.
I'm guessing the reason you don't get the LUT RAM messages anymore is because xst is now interpreting that as a single constant value rather than as a table lookup. (Since i is a genvar, both i and i+1 are constant at synthesis time, and xst already knows Ks_mem is read-only.)

I was able to achieve 100MHz and route it with the current design, slightly modified.  Changed the PLL to output clk0 so no clock division, and I changed the K/K_next assignments as to what I described in my previous post.  This routes it to a min clock period of 9.742ns for me.
Congratulations, nice one! Wonder what exactly ISE was doing before...

It also seems like the worst case critical path related to the 100MHz clock is between Hasher[8] and Hasher[13], looks like an output of an adder in Hasher[8], adding rx_state and k_next gets registered into Hasher[13]'s shift_w1 register.
That's interesting.

Edit: Just noticed there was another page of discussion. fpgaminer's explanation is indeed correct, including for the bits I modified. (I'm afraid I'm actually responsible for a lot of the confusing bits, including the cur_ prefixes which are there to distinguish between the values of wn in this round and the variable new_w15 which holds the value of w15 used by the next round.)

It might be more useful to think of the w computation as being
Code:
w[i+16] = w[i] + s0(w[i+1]) + w[i+9] + s1(w[i+14])
because that's effectively how it's being calculated. cur_w<n> actually means w[<n>+i] and next_w15 will hold the value of w[i+16] one clock cycle later.

This is only applicable to makomk's latest revisions, where I think he chose to do a long chain loop instead of tight feedback because of how the W shift registers are implemented at LOOP_LOG2>0. He'll have to chime in here to verify, as I haven't worked all the logic out for myself.
That's one reason for the change, but it's probably not the only one; all those muxes at every stage in the pipeline aren't exactly cheap from what I can tell. (Also: don't forget the extra added clock cycle of delay in the feedback path in order to make each piece of data arrive at the input with cnt one higher than on its previous pass through the pipeline.)

Quad XC6SLX150 Board: 860 MHash/s or so.
magik
Jr. Member

Offline

Activity: 44

 July 28, 2011, 04:22:08 AM

just got off a long day of work, brain hurts, so I havn't really gone over what you've written here so far, but skimming it definitely seems like it will help me understanding everything.  When I get some free time I'll try to analyze it and work it all out, maybe even draw a single cell diagram for anyone else trying to wrap their heads around it.

On another note - the 2engine design I let running finally finished, and failed.  It ran out of placement sites - said it was short like 1k FFs and 3k LUTs to implement the design - and this was before routing, so it may not even have been able to attain 100 MHz there.  I am running it again, but it may take 2-3 days again - but this time I've enabled some more aggressive optimizations in the Xilinx ISE.

We might have to settle for having 1 fully unrolled engine, and then 1 LOOP_LOG2=1 engine running and that should hopefully get to ~150MHz pretty easily.

Has anyone else gotten a 100MHz version working?  Should I compile a bit file for someone to test it?  I hate flying blind with a target device to test on.
makomk
Hero Member

Offline

Activity: 686

 July 28, 2011, 08:08:59 AM

On another note - the 2engine design I let running finally finished, and failed.  It ran out of placement sites - said it was short like 1k FFs and 3k LUTs to implement the design - and this was before routing, so it may not even have been able to attain 100 MHz there.  I am running it again, but it may take 2-3 days again - but this time I've enabled some more aggressive optimizations in the Xilinx ISE.
Ah. If you look at the list of LUTs it couldn't place, you'll almost certainly find that they're adders with carry chains: there aren't enough suitable sites with fast carry hardware to place all of the 32-bit-wide adders contiguously. I got the same error trying to fit a fully unrolled engine onto a XC6SLX75.

Quad XC6SLX150 Board: 860 MHash/s or so.
makomk
Hero Member

Offline

Activity: 686

 July 29, 2011, 01:32:19 PM

Finally got a set of modifications to the K_next lookup that ISE likes! It reported 100 MHz at LOOP_LOG2=1 on XC6SLX75 and I've sent a pull request to fpgaminer. Looks like it might be a bit hit and miss as to whether it meets timing though, since the next run failed with an actual period of 10.412ns rather than the required 10ns.

Edit: Might help if I actually linked to it. It's in the projects/LX150_makomk_Test dir of the xilinx-better-k-for-merge branch
Edit 2: Now it's consistently getting 10.412ns *sigh*.
Edit 3: Trivial fix for this - in addition to using that branch, find the line:
Code:
.rx_t1_part(feedback ? t1_part_fb : (rx_state[`IDX(7)] + cur_w0 + K))
Code:
.rx_t1_part(feedback ? t1_part_fb : (rx_state[`IDX(7)] + rx_input[31:0] + K))
You should now be able to achieve 100MHz (50MHash/sec) on the XC6SLX75, at least in theory. It's late and I need to sleep so this isn't committed anywhere yet.

Quad XC6SLX150 Board: 860 MHash/s or so.
fpgaminer
Hero Member

Offline

Activity: 560

 July 30, 2011, 12:42:51 PM

Quote
Trivial fix for this - in addition to using that branch, find the line:
Hmm ... that change doesn't seem like it should have any impact; ISE should be able to figure out that cur_w0 is rx_input[31:0] there unless I missed something. Was the timing analyzer pointing that out as being the trouble maker?

makomk
Hero Member

Offline

Activity: 686

 July 30, 2011, 04:00:48 PM

Hmm ... that change doesn't seem like it should have any impact; ISE should be able to figure out that cur_w0 is rx_input[31:0] there unless I missed something.
It might for LOOP_LOG2=0, but it doesn't appear to for LOOP_LOG2=1. I've no idea why it doesn't; perhaps it doesn't know to make that particular optimization.

Was the timing analyzer pointing that out as being the trouble maker?
The timing analyzer was pointing to a long chain of logic, but it did indeed include an unnecessary mux between w0_fb and rx_input[31:0] feeding into the computation of the initial value of t1_part. Replacing that with a direct reference to rx_input[31:0] resulted in it reaching timing closure at 100MHz. (Again: this is all with LOOP_LOG2=1)

Quad XC6SLX150 Board: 860 MHash/s or so.
fpgaminer
Hero Member

Offline

Activity: 560

 August 02, 2011, 12:10:36 PM

August 8th, 2011 - Spartan-6 Code Added and Working, Altera Mining Script Updated
Thanks to the efforts of makomk, the Spartan-6 series of chips are now supported, and achieve the highest performance per \$ of any chip. Code has been verified working on my LX150 development kit, and can achieve up to 100MH/s of performance.

The Altera Tcl Mining Script has just received a massive update. No more need to edit mine.tcl to hack in your FPGA's hardware and device names; mine.tcl will automatically detect mining FPGAs connected to the system. Pool information has been moved to a config.tcl file for easy editing. No more dependency on TclCurl, so the script should be Linux friendly now. And best of all, the console output has been cleaned up to look like poclbm.

Here is a full list of updates:

• No more dependency on TclCurl, so less hassle on Linux.
• Auto-detection of mining FPGAs. No more need to edit mine.tcl to put in your FPGA's information. Just run and the script should find the FPGA automatically.
• Pool url, worker username, and worker password split into config.tcl. See config.example.tcl.
• All debug messages removed from console output.
• Console output is timestamped.
• Miner information reported to console similar to poclbm's console output.
• Console reports FPGA actual hashing rate, estimated hashing rate, and accepted/rejected shares.
• Better error handling.
• Tracking accepted/rejected shares.
• Generic API introduced for communication with the FPGA, allowing better future support for different chips/firmwares.
• General code maintainability improvements.

Screenshot of the new console output:

Screenshot of the script automatically detecting my FPGA

There are still two missing major features. 1) It doesn't find and utilizing multiple connected FPGAs. 2) Long Polling
The firmware code also needs to be updated to support a work queue and results queue. All of these are planned future updates. I will also (hopefully) merge the Xilinx and Altera mining scripts together, for ease of use and mixing both kinds of chips on the same system

TheSeven
Hero Member

Offline

Activity: 504

FPGA Mining LLC

 August 03, 2011, 08:43:35 PM

Do you want to get rid of the virtual wire / chipscope? If yes, we should probably decide on a common simple serial protocol for all these FPGA designs, so that one software will fit them all. I'll happily implement that in my python miner, so you don't need to care about the software side of things.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
pusle
Member

Offline

Activity: 89

 August 04, 2011, 04:32:15 AM

I second that.  Uarts are easily implemented in FPGA's and the obvious choice for a "universal" interface. (via usb mostly)

May I suggest that the protocol handles multiple devices connected in parallel by some form of adressing.
This way lots of fpga's can be connected together easily on a common board and served across a single com port. (similar to RS-485)

I guess this thread is as good place as any to discuss the protocol?

fpgaminer
Hero Member

Offline

Activity: 560

 August 04, 2011, 08:36:59 AM

Quote
If yes, we should probably decide on a common simple serial protocol for all these FPGA designs, so that one software will fit them all. I'll happily implement that in my python miner, so you don't need to care about the software side of things.
I completely agree, and Python is a far better choice for controller software than Tcl

I took the time to write out some preliminary specifications for the internal hardware interfaces:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/wiki/Specification:-Components,-Interfaces,-and-Protocols-%5BWIP%5D

TL;DR: I abstracted away all of the hardware and implementation specifics and shielded them with a single, memory mapped component called the Comm. Being memory mapped, it's a trivial interface that protocols like I2C and SPI can wrap easily.

Relevant to the topic at hand, the specifics of this wrapping by SPI, I2C, UART, JTAG, or what have you will be documented here:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/wiki/Specification:-PHY-Interfaces

And yes, I've abused the term PHY in that document and the previous specification. Because I'm a horrible person and I want to watch language burn for my own personal pleasure

Currently in that PHY specification I've only outlined the requirements. TL;DR: It must support multiple devices (addressing a single device specifically), allow reading registers, and writing registers. That's it.

I haven't laid down any ground work for any specific protocols yet, but I listed out what the most immediate ones probably are: UART, SPI, and JTAG. Feel free to begin specifying those in this thread.

And as always, feel free to critique my work, call me stupid, and redo the whole thing

fpgaminer
Hero Member

Offline

Activity: 560

 August 04, 2011, 10:13:50 AM

Regarding Performance Optimizing Spartan-6 LX150:
DANGER: Long detailed post coming... sorry, I hope the information is useful though

I'm working with my LX150_makomk_speed_Test project, where I'm trying to nail down the performance bottlenecks and remove them. I'm learning up FPGA Editor so I can better visualize what the router is doing, and I've read through some of the Spartan-6 UGs to get a better understanding of the architecture.

First off, I will say I am quite impressed by Xilinx's work and foresight on the logic of the S6's slices. They can perform a 3 component, 32-bit addition in 8 chained slices, with 4-bits being computed per slice. That blew my mind when I saw it in the FPGA Editor. This is great for our mining algorithm, and you can see why in this critical path analysis:

Code:
W:           16 slices + 0 slices = 16
tx_t1_part:  8  slices + 0 slices = 8

t1:          8  slices + 0 slices
tx_state[0]: 8  slices + 8 slices = 16
tx_state[4]: 8  slices + 8 slices = 16
The worst critical paths are only 16 slices long, with a single break in the carry chain (AFAIK). W is a 4-way, performing a 3-way of the first 3, and a 2-way of the result and the remaining component. tx_state[4] is a 2-way with t1 and rx_state[3].

I haven't fully analyzed the router's behavior on the 2-way's yet, but it appears to include work from other operations ... somehow. Not sure yet.

So, that's the good news. The bad news is, of course, only half of the slices are useful. There are two slices in a CLB. One slice always has fast-carry logic and chains to the slice directly above it (in the CLB above it). The other slice is a lowest form of life slice. It's still a powerful slice, with 4 6-LUTs (or 8 5-LUTs, or combinations thereof), and 8 flip flops, but the mining algorithm has rare use for it.

The next bad news is, only half of the "good" slices can be used as RAM or shift registers. That's not a terrible thing since most will be consumed as adders anyway.

And that's about all I could find that's particularly good or bad with the S6 slices. Since the good slices are all in columns, and spaced evenly, the impact of the useless slices should actually be far less severe than I thought.

For the S6's routing architecture, the quick overview basically said routing costs roughly Manhattan Distance between CLBs. I haven't dug into the details more than that at this point.

With that knowledge in hand, and some beginner's experience with FPGA Editor, I dived in and found what appears to be the largest bottleneck in the current code:

Code:
Slack (setup path):     0.264ns (requirement - (data path - clock path skew + uncertainty))
Source:               uut2/HASHERS[41].shift_w1/Mram_m (RAM)
Destination:          uut2/HASHERS[41].upd_w/tx_w15_30 (FF)
Requirement:          12.500ns
Data Path Delay:      11.716ns (Levels of Logic = 6)
Clock Path Skew:      -0.260ns (0.780 - 1.040)
Source Clock:         hash_clk rising at 0.000ns
Destination Clock:    hash_clk rising at 12.500ns
Clock Uncertainty:    0.260ns

Clock Uncertainty:          0.260ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ):  0.070ns
Total Input Jitter (TIJ):   0.000ns
Discrete Jitter (DJ):       0.450ns
Phase Error (PE):           0.000ns

Maximum Data Path at Slow Process Corner: uut2/HASHERS[41].shift_w1/Mram_m to uut2/HASHERS[41].upd_w/tx_w15_30
Location             Delay type         Delay(ns)  Physical Resource
Logical Resource(s)
-------------------------------------------------  -------------------
RAMB16_X2Y46.DOA2    Trcko_DOA             1.850   uut2/HASHERS[41].shift_w1/Mram_m
uut2/HASHERS[41].shift_w1/Mram_m
SLICE_X60Y126.A2     net (fanout=4)        5.845   uut2/HASHERS[41].cur_w1<2>
SLICE_X60Y127.BMUX   Tcinb                 0.292   uut2/HASHERS[41].shift_w0/r<27>
SLICE_X78Y122.BMUX   Tilo                  0.251   uut2/HASHERS[41].upd_w/tx_w15<23>
SLICE_X78Y122.COUT   Topcyc                0.295   uut2/HASHERS[41].upd_w/tx_w15<23>
SLICE_X78Y123.COUT   Tbyp                  0.076   uut2/HASHERS[41].upd_w/tx_w15<27>
SLICE_X78Y124.CLK    Tcinck                0.341   uut2/HASHERS[41].upd_w/tx_w15<31>
uut2/HASHERS[41].upd_w/tx_w15_30
-------------------------------------------------  ---------------------------
Total                                     11.716ns (3.484ns logic, 8.232ns route)
(29.7% logic, 70.3% route)

It's being forced to route from a RAMB16BWER to a CLB that's right smack dab in the middle of a group of columns, furthest possible position from possible RAM locations. Here, check this image out, it will make you go insane, so don't stare too long:
http://i.imgur.com/gBv5R.png (RAM is on the left).
No, seriously, don't stare at it. The router will drive insanity into the depths of your soon rotting brain fleshes. After exploding the Universe, of course.

Oh, you looked at it anyway and are wondering about that little path heading downward? Yeah, it keeps going ... and going ... (into my damned soul).

And as you can read from the timing report above, routing accounts for *drum roll* 70.3%! Yay! That's 8ns of routing, and only 3.4ns of logic! Imagine if we got rid of all the routing...

I see four solutions at the moment, and will investigate as time allows:

1) Get rid of the RAMB16BWER to some extent.
2) Add an extra register to the output of shifter_32b when inferring RAM logic. Flip-flops should route close to the logic and mask RAM routing delay.
3) Add two, duplicate registers to the output of shifter_32b when inferring RAM logic.
4) Ditch the RAM infer completely and try to coax ISE into using all those flip-flops in the useless slices (which are peppered throughout the routed design at the moment).

I will try 3 first, and hope ISE does the intelligent thing. My hope is that flip-flops in the useless slices will get utilized, since they're mingled in with the useful logic and so should provide somewhat fast local routing.

The interesting this is that we've got lots of RAM to play with. The design is using ~30% of the 16BWERs, and none of the 8BWERs. It seems like a good idea to try to use them and bring slice consumption down if possible, but only if their awkward placement can be solved appropriately.

makomk
Hero Member

Offline

Activity: 686

 August 04, 2011, 11:31:27 AM

First off, I will say I am quite impressed by Xilinx's work and foresight on the logic of the S6's slices. They can perform a 3 component, 32-bit addition in 8 chained slices, with 4-bits being computed per slice. That blew my mind when I saw it in the FPGA Editor. This is great for our mining algorithm, and you can see why in this critical path analysis:
Yeah, neat isn't it? That's a big part of the reason I was initially confident about Spartan 6 performance, at least until I found out about the catch...

So, that's the good news. The bad news is, of course, only half of the slices are useful. There are two slices in a CLB. One slice always has fast-carry logic and chains to the slice directly above it (in the CLB above it). The other slice is a lowest form of life slice. It's still a powerful slice, with 4 6-LUTs (or 8 5-LUTs, or combinations thereof), and 8 flip flops, but the mining algorithm has rare use for it.

It's being forced to route from a RAMB16BWER to a CLB that's right smack dab in the middle of a group of columns, furthest possible position from possible RAM locations. Here, check this image out, it will make you go insane, so don't stare too long:
http://i.imgur.com/gBv5R.png (RAM is on the left).
No, seriously, don't stare at it. The router will drive insanity into the depths of your soon rotting brain fleshes. After exploding the Universe, of course.
Whoops, you're right. It appears to be doing the something similar on the SLX75 too; I should've spotted that sooner. (Judging from the timing report at least; I haven't looked in FPGA Editor because the Linux version is a pain in the ass to actually run on my distro. It uses some grotty ancient Motif-based wrapper library that emulates some prehistoric version of the Windows API and requires deep magic to get it to launch.)

The interesting this is that we've got lots of RAM to play with. The design is using ~30% of the 16BWERs, and none of the 8BWERs. It seems like a good idea to try to use them and bring slice consumption down if possible, but only if their awkward placement can be solved appropriately.

The 8BWERs suffer from some slightly annoying errata. Also I think they may actually be shared with the 16BWERs and so are not actually all available, but don't quote me on that. Not sure to what use they can be put either.

Quad XC6SLX150 Board: 860 MHash/s or so.
fpgaminer
Hero Member

Offline

Activity: 560

 August 04, 2011, 12:07:12 PM

Quote
Not sure to what use they can be put either.
I'm guessing you could pair them up to get the equivalent of a 16BWER, but I'm not sure.

I just tried adding an extra register to the shifter's inferred RAM. After compilation it failed timing (80MHz) and ... the register was gone. I'm guessing it optimized the register away somehow, or balanced it. Either way, it ended up having a negative impact. I'm running another compile with USE_XILINX_BRAM_FOR_W off to see how that works. Perhaps we need to find a way for ISE not to optimize the shifter so much when USE_XILINX_BRAM_FOR_W is being used?

makomk
Hero Member

Offline

Activity: 686

 August 04, 2011, 12:21:33 PM

I just tried adding an extra register to the shifter's inferred RAM. After compilation it failed timing (80MHz) and ... the register was gone. I'm guessing it optimized the register away somehow, or balanced it. Either way, it ended up having a negative impact. I'm running another compile with USE_XILINX_BRAM_FOR_W off to see how that works. Perhaps we need to find a way for ISE not to optimize the shifter so much when USE_XILINX_BRAM_FOR_W is being used?
If you haven't tweaked ISE's setting for the minimum size of shift register to infer yet, you probably need to do that. (By default two or more registers in a row are automatically replaced by a shift register.)

Quad XC6SLX150 Board: 860 MHash/s or so.
pusle
Member

Offline

Activity: 89

 August 04, 2011, 01:07:21 PM

The data would most likely end up going through a com port eventually.  Might it not be better to have an 8bit data width in the registers etc ?
Then you don't have to worry about dis/assembling the 32bit words either and the usual what are LSB and MSB, endian ARGH!

These huge 512/256 bit internal busses between slave, master and comm will eat quite a bit of routing resources.
I would suggest having an internal serial format using the system clock + 1 bit bus and a clock enable/data valid ctrl signal.
Sender has a tiny state machine/counter, while the receiver module is only a shift register with a clock enable.

To make up for the tiny tiny waste in waiting for new data your could have double buffers in the slaves.

makomk
Hero Member

Offline

Activity: 686

 August 04, 2011, 08:43:29 PM

If you haven't tweaked ISE's setting for the minimum size of shift register to infer yet, you probably need to do that. (By default two or more registers in a row are automatically replaced by a shift register.)

Hmmm. Actually, that doesn't seem to quite solve the problem. Try something like this patch. Causes slightly worse timing elsewhere on XC6SLX75 so it's actually a net loss there, but you may be able to avoid or work around this. (Edit: Also, KEEP is probably not the right constraint to use here; it will cause unneeded registers and logic and RAM we want to elimiate to be kept.)

Quad XC6SLX150 Board: 860 MHash/s or so.
makomk
Hero Member

Offline

Activity: 686

 August 10, 2011, 06:10:10 PM

Finally got around to coding some maximum clock speed improvements for users of smaller Cyclone III and IV devices - now available from my new partial-unroll-speed branch. Expected minimum device size and speed is roughly as follows:

LOOP_LOG2=0: minimum device size ~75k LEs, maximum theoretical speed 110 MHash/s @ 110 MHz. Use the DE115_makomk_mod directory in fpgaminer's main git repository.
LOOP_LOG2=1: minimum device size ~50k LEs, max theoretical speed 50 MHash/s @ 100 MHz. Use DE2_115_Unoptimized_Pipelined directory in the partial-unroll-speed branch of my github repository.
LOOP_LOG2=2: minimum device size ~30k LEs, max theoretical speed 25 MHash/s @ 100 MHz. Use same partial-unroll-speed code, but edit the line after "`ifdef USE_EXPLICIT_ALTSHIFT_FOR_W" in sha256_transform.v to read "if(LENGTH >= 5) begin" rather than 4.
LOOP_LOG2=3: minimum device size ~20k LEs, max theoretical speed 13.75 MHash/s @ 110 MHz. Use partial-unroll-speed again, no modifications required.
LOOP_LOG2>3: not supported by either of these two code code versions, stick to the original partially-unrolled code in the DE2_115_Unoptimized_Pipelined directory of fpgaminer's git repository? (Bear in mind it may not be very area-efficient.)

It may be possible to get some of these up to 110 MHz or fit them onto slightly smaller devices (actual LE usage is currently about 74k, 46k, 26k?, and 16k respectively).

Also: remember that some or all of these may require extra cooling in order to operate safely.

Quad XC6SLX150 Board: 860 MHash/s or so.