Bitcoin Forum
March 19, 2024, 09:29:53 AM *
News: Latest Bitcoin Core release: 26.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 [22] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
Author Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013)  (Read 432863 times)
makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
July 21, 2011, 12:20:01 PM
Last edit: July 21, 2011, 02:09:45 PM by makomk
 #421

I tried compiling a version of my code, LOOP_LOG2=1, and two extra pipeline registers with Register Balancing enabled. It couldn't even get past Mapping, with an area constraint error  Tongue
The LOOP_LOG2>0 code isn't terribly efficient. There are some changes to make it a bit better in the partial-unroll-opt branch on my github repo; I'll see if I can get them merged in to DE2_115_makomk_mod and send you a pull request at some point.

Just put together an untested port of them to Xilinx and got an apparently successful synthesis run for the XC6SLX75 at LOOP_LOG2=1 and 50 MHz. This is with some experimental block-RAM-as-shift-register code that I'm not sure will work, though. Will upload the code in a bit.

Edit: Also, the DCM_SP doesn't seem to work quite the way I might expect. Curious.
Edit 2: Aha - with the way this code uses it, the CLKFX_DIVIDE and CLKFX_MULTIPLY settings are effectively irrelevant, as is CLKDV_DIVIDE.
Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). Looks like I'd need to port over the patch for the extra pipeline stage to compute the initial value of H+K+W[0]. Untested 50Mhz code dump here

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
1710840593
Hero Member
*
Offline Offline

Posts: 1710840593

View Profile Personal Message (Offline)

Ignore
1710840593
Reply with quote  #2

1710840593
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 22, 2011, 12:49:32 AM
 #422

Quote
The LOOP_LOG2>0 code isn't terribly efficient.
It's grotesquely inefficient compared to LOOP_LOG2=0  Tongue For the vanilla code (DE2_115_Unoptimized), LOOP_LOG2=1 is almost as big as LOOP_LOG2=0 on Altera. It's terrifying.

Quote
Just put together an untested port of them to Xilinx and got an apparently successful synthesis run for the XC6SLX75 at LOOP_LOG2=1 and 50 MHz. This is with some experimental block-RAM-as-shift-register code that I'm not sure will work, though. Will upload the code in a bit.
Thank you for sharing. I'll muck with the code a bit and try it out on my LX150. I've got my fingers crossed ...

Quote
Edit: Also, the DCM_SP doesn't seem to work quite the way I might expect. Curious.
Edit 2: Aha - with the way this code uses it, the CLKFX_DIVIDE and CLKFX_MULTIPLY settings are effectively irrelevant, as is CLKDV_DIVIDE.
The DCM confuses the heck out of me. I really should read its datasheet and get my head straight. Anyway, the code I use is on the public repo, so you can take a look at the DCM that coregen made for my project: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/blob/master/projects/LX150_Test/hdl/main_pll.v
It uses the CLKDV_DIVIDE parameter to take an input 100MHz clock and spit out a 50MHz clock.

Quote
Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns).
If it fully unrolls on the LX150 and gets close to 100MHz I will be very happy. That alone will yield higher MHash/s/$ than the current Altera solution, and we can build up from there. Those DSP slices are sitting all lonely and unused  Smiley

makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
July 22, 2011, 01:41:58 AM
 #423

It's grotesquely inefficient compared to LOOP_LOG2=0  Tongue For the vanilla code (DE2_115_Unoptimized), LOOP_LOG2=1 is almost as big as LOOP_LOG2=0 on Altera. It's terrifying.
Bigger, from what I remember. These changes get it down to about 51k LEs for LOOP_LOG2=1 on Cyclone IV, which isn't great but...

The DCM confuses the heck out of me. I really should read its datasheet and get my head straight. Anyway, the code I use is on the public repo, so you can take a look at the DCM that coregen made for my project: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/blob/master/projects/LX150_Test/hdl/main_pll.v
It uses the CLKDV_DIVIDE parameter to take an input 100MHz clock and spit out a 50MHz clock.
Yeah, that makes sense. Having read the datasheet, the CLKn outputs are the same frequency as the input, CLK2X is 2x (obviously), CLKDV is input frequency/CLKDV_DIVIDE and CLKFX is input frequency*CLKFX_MULTIPLY/CLKFX_DIVIDE.

If it fully unrolls on the LX150 and gets close to 100MHz I will be very happy. That alone will yield higher MHash/s/$ than the current Altera solution, and we can build up from there. Those DSP slices are sitting all lonely and unused  Smiley
I'm not even touching the DSP slices this time around, at least not initially; they appear to do not-so-great things to the routing and placement...

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 22, 2011, 06:22:46 AM
 #424

I got your code rolled into an LX150T project with ChipScope (JTAG) for the "virtual wires." It compiles up just fine at 50MHz and LOOP_LOG2=1 Smiley However it does not deliver correct results when run on the live chip. The results are consistent, and the chip reads 54C at the surface, so I don't think the timing is wrong, nor is it overheating. I will run the code through modelsim and see if I can track down the bug.

https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test

Side Note: I tried out ISE's power analysis tool. I've never used it before, and I don't have a post P&R simulation vector to give it, so I just set some over-estimated toggling values. 120 FF toggle, 100% BRAM usage, I think. It reported something like 1.4W. Seems a bit low to me ... any idea if that seems reasonable to you?

Quote
Bigger, from what I remember. These changes get it down to about 51k LEs for LOOP_LOG2=1 on Cyclone IV, which isn't great but...
That's great progress in my book. Well done, makomk! The rolled up designs are important as well, because they can help fill out large chips.

Quote
they appear to do not-so-great things to the routing and placement...
ಠ_ಠ

fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 22, 2011, 09:45:16 AM
 #425

phew, finally tracked down the bug. The K and K_next wires in sha256_transform.HASHERS were not getting the right values. K was easy:

Code:
assign K = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i];

K_next is a little bit more complicated, because it has to use cnt differently for each HASHERS. For example, if LOOP_LOG2=1, then K_next in HASHERS[1] needs to alternate between Ks_mem[34] and Ks_mem[2], when cnt=0 and cnt=1 respectively. In HASHERS[2] K_next alternates between Ks_mem[3] and Ks_mem[35] respectively.

It's a little weird, but it makes sense. When LOOP_LOG2=1, HASHERS[0] alternates between doing fresh work (cnt=0, Round 0) and doing old work (cnt=1, Round 32): new, old, new, old, etc. Since HASHERS[1] is directly connected to HASHERS[0] it alternates as well, but it will alternate in the opposite fashion. It gets old work, and then new work, old, new, old, new, etc.

I threw a quick hack into my code for K_next (it's on the public repo), that only works for LOOP_LOG2=1:

Code:
if(i & 1)
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*!cnt[0]+i+1];
else
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i+1];

Off the top of my head, I think this will work in the general case:
Code:
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];
That basically says, adjust cnt by our position in the HASHERS chain. I can try it out later and check.

With that fix, the code works correctly in ModelSim, and it works on live hardware  Cheesy So my LX150T dev board finally gets 25MH/s of performance. Progress! I may run a compile of LOOP_LOG2=0 overnight and see if that finishes.

I'm finally making pleasant progress with the LX150 because of your hard work, makomk, so thank you.

makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
July 22, 2011, 10:34:46 AM
Last edit: July 22, 2011, 11:12:36 AM by makomk
 #426

phew, finally tracked down the bug. The K and K_next wires in sha256_transform.HASHERS were not getting the right values.
Whoops, you are indeed right. I changed the non-USE_RAM_FOR_KS case but forgot to change or test the  USE_RAM_FOR_KS one. Sorry!

K_next is a little bit more complicated, because it has to use cnt differently for each HASHERS. For example, if LOOP_LOG2=1, then K_next in HASHERS[1] needs to alternate between Ks_mem[34] and Ks_mem[2], when cnt=0 and cnt=1 respectively. In HASHERS[2] K_next alternates between Ks_mem[3] and Ks_mem[35] respectively.

It's a little weird, but it makes sense. When LOOP_LOG2=1, HASHERS[0] alternates between doing fresh work (cnt=0, Round 0) and doing old work (cnt=1, Round 32): new, old, new, old, etc. Since HASHERS[1] is directly connected to HASHERS[0] it alternates as well, but it will alternate in the opposite fashion. It gets old work, and then new work, old, new, old, new, etc.
Yep, that's entirely correct. I remember it wasn't much fun to figure this out; took me several hours to get right myself. (In fact, the whole code's a tad hairy.)

Off the top of my head, I think this will work in the general case:
Code:
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];
That basically says, adjust cnt by our position in the HASHERS chain. I can try it out later and check.
That corresponds to what I was doing for the tested non-USE_RAM_FOR_KS case, so it should work. (Of course, the whole thing goes pear-shaped anyway if LOOP_LOG2 > 3 for reasons I haven't pinned down.)

With that fix, the code works correctly in ModelSim, and it works on live hardware  Cheesy So my LX150T dev board finally gets 25MH/s of performance. Progress! I may run a compile of LOOP_LOG2=0 overnight and see if that finishes.

I'm finally making pleasant progress with the LX150 because of your hard work, makomk, so thank you.
Yay - good news at last! Sorry again about that bug.

Edit: Fix tested in Modelsim at all working LOOP_LOG2 values (0, 1, 2 and 3) and pushed to partial-unroll-opt branch.

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
newMeat1
Full Member
***
Offline Offline

Activity: 210
Merit: 100



View Profile
July 22, 2011, 08:12:28 PM
 #427

Great work guys! Thanks

fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 23, 2011, 02:03:31 AM
 #428

Quote
Yep, that's entirely correct. I remember it wasn't much fun to figure this out; took me several hours to get right myself.
No kidding. It broke my brain for awhile, until I realized it was just a delay chain, so you could add to cnt to get what cnt "looks like" at each stage in the chain.

Quote
Edit: Fix tested in Modelsim at all working LOOP_LOG2 values (0, 1, 2 and 3) and pushed to partial-unroll-opt branch.
Wonderful, thank you for checking!

Quote
Sorry again about that bug.
No worries. Your work is greatly appreciated, and I'm very excited to get my LX150 dev kit mining Smiley The guys over at the Modular FPGA hardware design thread will also be quite happy, since their design is based around the LX150.

I ran a LOOP_LOG2=0 compile overnight. Turns out, the compiles actually take very little time; under an hour. And yes, it completes just fine at 50MHz  Grin However, I've made silly mistake after silly mistake in the code, resulting in countless re-compiles. I'm hoping the compile I have going right now is the last one, and I can finally get correct results from the live hardware. I will report back with success once I've got it.

Looks like device utilization is about 50%, which is good, and XPower estimates 2.2W of consumption (FF toggle at 200, BRAM at 100%). I measure 50C on the chip's surface, ~38C with a small fan.

fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 23, 2011, 05:22:34 AM
 #429

I have now confirmed that with LOOP_LOG2=0, at 50MHz, the design works on live hardware and returns correct results. That means the Spartan-6 LX150 is now confirmed to perform at 50MHash/s.

Public repo has been updated with the code I just compiled and tested.

I want to write a mining script for it, and test it on a real pool. From there I'll ramp up the clock to see how close it will get to 100MHz Smiley

Quote
Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). Looks like I'd need to port over the patch for the extra pipeline stage to compute the initial value of H+K+W[0].
On my last run, I think ISE reported an actual period of 15ns. I'm still getting used to the timing report in ISE, so I could be wrong. Regardless, that's with it targeting 50MHz so I'm sure it will give better results with tighter constraints. I will certainly try to patch it for the initial t1_partial; that's bound to be helpful.

magik
Newbie
*
Offline Offline

Activity: 44
Merit: 0


View Profile
July 23, 2011, 04:46:58 PM
Last edit: July 23, 2011, 06:40:05 PM by magik
 #430

ooh interesting stuff going on here for Spartan devices eh?  I need to check some of this out in my compiler as well.

The latest confirmed 50MHash/s on the lx150 - which codeset is that? the LX150_makomk directory?

and thats a lx150 or lx150t?

also, I see a testbench in there - do you have maybe a timing diagram of expected/correct outcome for those inputs?

not too familiar with ISE 13 myself or verilog for that matter - i use mostly vhdl, but it also looks like you left a chipscope core in the project file in the github

also looks like the ucf is set up to receive a 100MHz clock and I don't see any clock dividers in the code?

edit:
hrm... so it seems you are using chipscope to communicate with the chip? interesting, I havn't seen that before - you guys discuss that somewhere in this thread? what software are you using to talk through the chipscope objects?
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 23, 2011, 11:28:33 PM
 #431

Quote
The latest confirmed 50MHash/s on the lx150 - which codeset is that? the LX150_makomk directory?
Yup!
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_Test

Quote
and thats a lx150 or lx150t?
I have the Spartan 6 LX150T Development Kit, so it's an LX150T chip. However the T makes no difference on the performance of the algorithm; it merely indicates that the chip has transceivers on it, which are irrelevant to a mining application. The mining board being developed in this thread will use the Spartan 6 LX150-3N variant, which has no transceivers, -3 speed grade (fastest), and N for no memory controller.

Quote
also, I see a testbench in there - do you have maybe a timing diagram of expected/correct outcome for those inputs?
The testbench is a bit primitive at the moment. There are no test waveforms, and it isn't fully automated in that it won't just tell you if the design passed or failed (nor why it failed). It sets up the fpgaminer_top module as the Unit Under Test, manually setting up its internal registers for the test data documented in this file. When I do a test to verify the design I load the testbench up in ModelSim, put the top-level signals of the uut on the wave viewer, and tell it to run the simulation for about 8us. After that I can check the golden_nonce register to see if it matches the correct value of 0x0e33337a.

If it doesn't match then I debug manually. Ideally a robust testbench would check every stage of the SHA-256 calculations automatically and report the failures, but it's a bit non-trivial to implement because of the parameterized pipelining and the countless variations on the code by this point.

Quote
also looks like the ucf is set up to receive a 100MHz clock and I don't see any clock dividers in the code?
See main_pll.v which is instantiated in fpgaminer_top.

Quote
hrm... so it seems you are using chipscope to communicate with the chip? interesting, I havn't seen that before - you guys discuss that somewhere in this thread? what software are you using to talk through the chipscope objects?
It's a relic of my experience in using Altera chips. I've used abused Altera's In-System Source and Probes feature for a long time in various designs for quick debugging. It's very convenient, because it goes over JTAG which must be connected to program the chip anyway. Much nicer than having to run yet another cable around my already tangled desk.

My Altera implementations, on the DE2-115 dev board for example, use it and the mining script is already written and working: mine.tcl

So yes, my code for the LX150 ended up using the Xilinx equivalent, ChipScope's Virtual I/O. I did my initial tests to verify that the design is working on live hardware by simply using ISE's ChipScope interface. Now that the design is verified I am writing the actual mining script in TCL, for which Xilinx provides a ChipScope Engine interface.

Most other people seem to prefer using RS232 for the communication. I'm inclined to agree after seeing the tcl interface to ChipScope  Tongue But I don't have an RS232-USB adapter at home.

magik
Newbie
*
Offline Offline

Activity: 44
Merit: 0


View Profile
July 24, 2011, 05:39:24 PM
 #432

great reply, thanks, I have it successfully generating a golden nonce now in simulation, awesome.

I'm going to toy around and see if I can get this running faster than 100 MHz, or rather, routing @ faster than 100 MHz, I'm liking that ISE 13 has multi-core support for the stuff like routing and simulation now!
magik
Newbie
*
Offline Offline

Activity: 44
Merit: 0


View Profile
July 25, 2011, 02:31:42 AM
 #433

tried popping in two more sha cores to get 2 engines running ( fully unrolled ), ISE spit out this:
Quote
Slice Logic Utilization:
 Number of Slice Registers:           92543  out of  184304    50%  
 Number of Slice LUTs:                121337  out of  92152   131% (*)
    Number used as Logic:             113389  out of  92152   123% (*)
    Number used as Memory:             7948  out of  21680    36%  
       Number used as SRL:             7948

so looks like without a lil bit of massaging the current design uses up a bit more resources....

I'm gonna try 2 hashing engines ( 4 cores ) running at log_level2 - that should be able to fit, and then I'll see how fast I can scale up the clocking to get it routable
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 25, 2011, 04:39:37 AM
 #434

I've just updated the public repo with the Tcl mining script I wrote for my Spartan-6 Dev Kit. I'm not really happy with it, but it does work on my machine. There's a hardcoded filepath in mine.bat that I couldn't get rid of, and some hardcoded JTAG addresses in mine.tcl that are specific to the dev kit. I had to use dumb string parsing instead of actual JSON, because Xilinx is using Tcl 8.4 instead of 8.5 for ChipScope stuff; and I had to drop TclCurl in favor of Tcl's http package. On the bright side, the http package works great and is more portable than TclCurl so I might do the same replacement on the Altera mining script. That should allow it to run on Linux  Cheesy

This verifies that the current LX150 Xilinx design works with a live pool  Cool

teknohog, I sent you a donation for laying the groundwork in the Xilinx Verilog project. makomk I also sent you a donation for the hard work you've done achieving 110MHz on Altera, and getting a fully unrolled core working on the LX150 chip. Many thanks to the both of you, and everyone who contributes to this project!

Quote
tried popping in two more sha cores to get 2 engines running ( fully unrolled ), ISE spit out this:
That's an awful lot of LUT usage. A single unrolled engine uses under 50% of the LUTs. Are those stats post Synthesis, or post Routing?

My guess is that by the resource usage of a single engine, two should fit on an LX150. However, it has been mentioned before that the Spartan-6 devices don't have fast carry chain routing on half of the slices. That may impede the ability to get two engines on an LX150. It might make more sense to use a single engine with extra pipelining and see if we can get it clocked up towards 200MHz. I will certainly be exploring both options.

Thank you for reporting those numbers, though. Let me know how your latest experiments go Smiley

magik
Newbie
*
Offline Offline

Activity: 44
Merit: 0


View Profile
July 26, 2011, 07:38:56 PM
Last edit: July 26, 2011, 09:03:12 PM by magik
 #435

hrm.... yeah been doing more testing... and it seems liek I have high LUT usage because some of the "RAM" is being inferred as LUTs?

do you get any of these messages when you compile?
Quote
INFO:Xst:3218 - HDL ADVISOR - The RAM <Mram_HASHERS[0].K> will be implemented on LUTs either because you have described an asynchronous read or because of currently unsupported block RAM features. If you have described an asynchronous read, making it synchronous would allow you to take advantage of available block RAM resources, for optimized device usage and improved timings. Please refer to your documentation for coding guidelines.
    -----------------------------------------------------------------------
    | ram_type           | Distributed                         |          |
    -----------------------------------------------------------------------
    | Port A                                                              |
    |     aspect ratio   | 64-word x 32-bit                    |          |
    |     weA            | connected to signal <GND>           | high     |
    |     addrA          | connected to signal <n1055>         |          |
    |     diA            | connected to signal <GND>           |          |
    |     doA            | connected to signal <HASHERS[0].K>  |          |
    -----------------------------------------------------------------------

really odd....  it's not happening to all of the sha_transform modules though... it only seems to be one.... the 2nd one with the NUM_ROUNDS set to 61 it appears


also, I see things like this when it's synthesizing:
Quote
   Found 6x6-bit multiplier for signal <n1055> created at line 120.
    Found 6x32-bit multiplier for signal <n1057> created at line 127.
line 120 is:
Quote
assign K = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i];
line 127 is:
Quote
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];
hrm... there has to be a better way to use those generate blocks to parse these values not as signals/wires to be used at runtime, but rather generate constant integers or look up tables/mux's to generate these...

edit: update
if I use this for the K and K_next assignment when LOOP == 1, I don't get the LUT messages anymore:
Quote
`ifdef USE_RAM_FOR_KS
         if ( LOOP == 1) begin
            assign K = Ks_mem[ i ];
            assign K_next = Ks_mem[ i + 1 ];
         end else begin
...
I think the problem is that K and K_next are not assigned in a clock state, thus they become asynchronous combinatorial logic - and XST can't map that to a ROM?  Or maybe it's the addition of using a multiplier output as an address selector?  Something in there XST wasn't liking for me.

also, it seems the 1st round synthesizes much differently?
for the first sha block I get this:
Quote
   Summary:
   inferred  10 Adder/Subtractor(s).
   inferred 551 D-type flip-flop(s).
   inferred  17 Multiplexer(s).
Unit <sha256_transform_1> synthesized.

for the 2nd block I get this:
Quote
   Summary:
   inferred  62 RAM(s).
   inferred   2 Multiplier(s).
   inferred  63 Adder/Subtractor(s).
   inferred 295 D-type flip-flop(s).
   inferred  17 Multiplexer(s).
Unit <sha256_transform_2> synthesized.

why are these so different!?

first off are they sharing the RAM for the K's ?  It seems only the K's for the 2nd block are generated, but Xilinx might be optimizing across the hierarchy here.  But what about the # of adders/subtractors!? only 10 in the first block? how can that be?  or is it that it's shifting the position of the adders from the digester to the higher module?


I also see this:
Quote
Synthesizing Unit <shifter_32b_9>.
    Related source file is "e:/bitcoin/lx150_makomk_test/hdl/sha256_transform.v".
        LENGTH = 8
WARNING:Xst:3035 - Index value(s) does not match array range for signal <m>, simulation mismatch.
which relates to the shift register code wi:
Quote
      reg [31:0] m[0:(LENGTH-2)];
      always @ (posedge clk)
      begin
         addr <= (addr + 1) % (LENGTH - 1);

now when I look at that, I'm not sure if that's correct, so lets say LENGTH = 8.  The first line says create a 32-bit register array, with (8-2+1) elements, so 7 elements, but the addr modulous wraps around at 7 - e.g. once ( addr + 1 ) == 7, then addr becomes 0, not 7.  So we are missing the last element of the shift register.

I think this is just an indexing problem - LENGTH = 8 means 8 elements in the shift register.  so you want reg[32:0] m[0:7] or reg[32:0] m[0:(LENGTH-1)].  Then below on the addr assignment, you would want addr <= ( addr + 1 ) % ( LENGTH ).  Because using a LENGTH of 8,  xxx % 8 will always return a value inclusively between 0 and 7.

Not sure how this is even working with one of the shift registers effectively 1 element short....
edit: seems if I "fix" this, it breaks it heh..... I need to look into this
ok another edit update, it seems this code is correct because you also have a 32-bit register r in there that's separate from the m storage register.  And that also explains the different synthesis for this module.  It's using a RAM, a 32-bit register r, 3-bit register addr, 9-bit adder for next address range, as opposed to just LENGTH*32 register/FF for the other types of shift registers... not sure which one is better here


on another note, I placed 2 cores ( 4 sha256 transforms ) into the design, it said I was using 140% LUTs, but it's still trying to route it right now?  It's been running for over 12 hours though....
makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
July 26, 2011, 08:27:27 PM
 #436

No kidding. It broke my brain for awhile, until I realized it was just a delay chain, so you could add to cnt to get what cnt "looks like" at each stage in the chain.
Heh. There's a reason I'd been putting off making those changes to the partial unrolling originally; it was fairly obviously beneficial, but also rather fiddly.

Quote
Edit 3: Can't quite get it to hit 100 MHz with LOOP_LOG2=1 on the XC6SLX75 (actual period = 11.045ns). Looks like I'd need to port over the patch for the extra pipeline stage to compute the initial value of H+K+W[0].
On my last run, I think ISE reported an actual period of 15ns. I'm still getting used to the timing report in ISE, so I could be wrong. Regardless, that's with it targeting 50MHz so I'm sure it will give better results with tighter constraints. I will certainly try to patch it for the initial t1_partial; that's bound to be helpful.
I haven't been able to reproduce the 11ns synthesis run for the XC6SLX75 since fixing the values of K_next, and I can't entirely figure out why; best I've seen since then is 14-15ns. (That's with 100 MHz as the target.) You might have better luck with a fully unrolled design, but then again perhaps not.

makomk I also sent you a donation for the hard work you've done achieving 110MHz on Altera, and getting a fully unrolled core working on the LX150 chip. Many thanks to the both of you, and everyone who contributes to this project!
Thank you! Though I'm not sure the extent to which I helped with that second one... don't even have the tools to attempt such a thing.

However, it has been mentioned before that the Spartan-6 devices don't have fast carry chain routing on half of the slices. That may impede the ability to get two engines on an LX150.
Haven't done the math on that but it'd probably work out much the same as fitting one on the LX75: not enough carry chains for all the adders, won't fit without some trickery.

hrm... there has to be a better way to use those generate blocks to parse these values not as signals/wires to be used at runtime, but rather generate constant integers or look up tables/mux's to generate these...
I have some changes to do this, but they're on a computer I don't have access to this second and I don't think I pushed them to any public repos. (For some reason I appeared to be seeing a negative effect on Cyclone IV clock speeds at LOOP_LOG2=0.)

You well may find the tables aren't actually being synthesized as LUT RAM in the end anyway.

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
TheSeven
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


FPGA Mining LLC


View Profile WWW
July 26, 2011, 09:43:55 PM
 #437

hrm.... yeah been doing more testing... and it seems liek I have high LUT usage because some of the "RAM" is being inferred as LUTs?

do you get any of these messages when you compile?
Quote
INFO:Xst:3218 - HDL ADVISOR - The RAM <Mram_HASHERS[0].K> will be implemented on LUTs either because you have described an asynchronous read or because of currently unsupported block RAM features. If you have described an asynchronous read, making it synchronous would allow you to take advantage of available block RAM resources, for optimized device usage and improved timings. Please refer to your documentation for coding guidelines.
    -----------------------------------------------------------------------
    | ram_type           | Distributed                         |          |
    -----------------------------------------------------------------------
    | Port A                                                              |
    |     aspect ratio   | 64-word x 32-bit                    |          |
    |     weA            | connected to signal <GND>           | high     |
    |     addrA          | connected to signal <n1055>         |          |
    |     diA            | connected to signal <GND>           |          |
    |     doA            | connected to signal <HASHERS[0].K>  |          |
    -----------------------------------------------------------------------

really odd....  it's not happening to all of the sha_transform modules though... it only seems to be one.... the 2nd one with the NUM_ROUNDS set to 61 it appears


also, I see things like this when it's synthesizing:
Quote
   Found 6x6-bit multiplier for signal <n1055> created at line 120.
    Found 6x32-bit multiplier for signal <n1057> created at line 127.
line 120 is:
Quote
assign K = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i];
line 127 is:
Quote
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];
hrm... there has to be a better way to use those generate blocks to parse these values not as signals/wires to be used at runtime, but rather generate constant integers or look up tables/mux's to generate these...

If you go for fully-unrolled, which you probably should on an LX150, the K memory can (and usually will, at least with my code) be completely eliminated.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
magik
Newbie
*
Offline Offline

Activity: 44
Merit: 0


View Profile
July 26, 2011, 11:12:56 PM
Last edit: July 26, 2011, 11:31:19 PM by magik
 #438

good point, I should just be mapping the K's directly into the hashers as constants

I was able to achieve 100MHz and route it with the current design, slightly modified.  Changed the PLL to output clk0 so no clock division, and I changed the K/K_next assignments as to what I described in my previous post.  This routes it to a min clock period of 9.742ns for me.

It also seems like the worst case critical path related to the 100MHz clock is between Hasher[8] and Hasher[13], looks like an output of an adder in Hasher[8], adding rx_state and k_next gets registered into Hasher[13]'s shift_w1 register.

I also don't like this chipscope in here heh, it's treating the signals you are looking at as clocks and thus routes them on BUFG's through BUFGMUXs.  It's low fanout so it doesn't matter that much, but from what I'm used to with FPGA design, you don't really want to be using non-clock signals as clocks, i.e. using these signals in edge logic, a la always @ posedge(some_sig).  I should probably see how much resources these chipscope modules are taking up as well....

it'd be nice if there was a higher level timing diagram/pipeline diagram for this process.  I'd love to know what exactly each unit should be doing at any one time, e.g. which "nonce"/hash is block X currently working on.

it's taken me a bit of time to figure out what's going where, and it really really really makes it hard to read and figure out what signals are what with the plethora of 1-letter signal/wire names....


and hrm... 2 engine design still tells me it's 140% LUTs lol.... and this is without getting the RAM => LUT message.  BRAM/FIFO usage is 202/268.... hrm.... i wonder if this compile will also last 12+ hours
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 27, 2011, 01:02:30 AM
 #439

Quote
I was able to achieve 100MHz and route it with the current design, slightly modified.  Changed the PLL to output clk0 so no clock division, and I changed the K/K_next assignments as to what I described in my previous post.  This routes it to a min clock period of 9.742ns for me.
Wonderful! I was able to route it at 90MHz successfully. Haven't tried 100MHz yet, because I figured it would require the extra Round0 partial t1 pipeline. I guess that isn't the case.

Quote
it'd be nice if there was a higher level timing diagram/pipeline diagram for this process.  I'd love to know what exactly each unit should be doing at any one time, e.g. which "nonce"/hash is block X currently working on.
At LOOP_LOG2=0 it is fairly simple. HASHERS[0] is working on nonce, HASHERS[1] is working on nonce - 1, HASHERS[2] is working on nonce - 2, ... HASHERS[63] is working on nonce - 63.

For LOOP_LOG2>0 it is a bit more confusing. The hashers are still in a chain, but there are less of them (by some power of two), and the last one in the chain feeds the first one. That allows work to loop around the hasher chain a number of times to complete all 64 needed rounds. As an example, a unit of work will enter in at HASHERS[0], get worked on until HASHERS[31], and then get fed back into HASHERS[0] to get worked on another 32 times (for a total of 64 rounds). This feedback occurs on alternating clock cycles as determined by the cnt signal. When cnt==0 HASHERS[0] processes new work. On all other cnt HASHERS[0] processes old work from the last hasher in the chain.

Looking at K_next and cnt can give you an idea of what the HASHERS are doing at any given time, although it's still a bit obfuscated because K_next is used to calculate the t1_partial for the next hasher Tongue

Quote
it really really really makes it hard to read and figure out what signals are what with the plethora of 1-letter signal/wire names....
Such as? Readability is actually important to me, so I appreciate the feedback. A lot of the signals have short names, like K and t1, but that is sourced directly from SHA-256 terminology. The code in sha256_transform is probably the biggest area for confusion, but it does have to handle a lot so I'm not surprised. Let me know any ways you think the names and design and be improved. I might even put together a chart like you suggested that shows graphically how the hashing chain works.

Quote
I haven't been able to reproduce the 11ns synthesis run for the XC6SLX75 since fixing the values of K_next, and I can't entirely figure out why; best I've seen since then is 14-15ns. (That's with 100 MHz as the target.) You might have better luck with a fully unrolled design, but then again perhaps not.
If you over constrain, you can making routing go haywire. That's why when I ramp up timing constraints I do it in increments. It is a huge pain in the butt to have to re-compile a number of times, but it gives a better gauge of what the design can actually handle. Usually I under constrain by a lot, see what the minimum period is, and then re-constrain to that period to push it further. Looks like magik has gotten 100MHz on an unrolled design. I've gotten 90MHz as well.

Quote
The first line says create a 32-bit register array, with (8-2+1) elements, so 7 elements, but the addr modulous wraps around at 7 - e.g. once ( addr + 1 ) == 7, then addr becomes 0, not 7.  So we are missing the last element of the shift register.
With 7 elements, the range would be 0 to 6, so modulus 7 is correct. I had to double check it myself when I first saw it, but it is indeed correct. And it most certainly helps the design compared to what I was initially trying.

Quote
on another note, I placed 2 cores ( 4 sha256 transforms ) into the design, it said I was using 140% LUTs, but it's still trying to route it right now?  It's been running for over 12 hours though....
Well I finally tried the same thing. Code is on the public repo now. It reports 139% LUT usage after synthesis, and gets stuck in Mapping. I let it run over night with it still stuck on Mapping (Global Placement phase).

The design with a single core reports ~70% LUT after synthesis, and ~40% after routing, so routing does quite a bit to optimize the design. So even though synthesis reported 140% I was hoping for Routing to fit the design anyway. I'm guessing either the design is far too large, or we've run up against the "useless" slices issue.

I will probably put the multicore design to the side for now and work on a speed oriented design instead, using those extra slices to enhance pipelining.

magik
Newbie
*
Offline Offline

Activity: 44
Merit: 0


View Profile
July 27, 2011, 02:07:43 AM
Last edit: July 27, 2011, 04:29:17 AM by magik
 #440

ok, so I'm not crazy yeah, pre-routing got me to 140% LUTs pre-routed, and the ISE has been at the map stage for something like 18 hours so far.... gonna let it run, I got a quad core, so doing parallel compiles is feasible.  The first global placement run took 4 hours for me.... and it's been stuck on the 2nd global placement run now - probably around 19 hours total running time so far... what's sad is this is the Map phase... Place and Route has yet to run =(

The other thing I've been told is your FPGA should never really exceed 60-70% usage pre-routing.... because a lot of resources are needed to get high-speed routing done...  And trying to pack 2 engines in there is likely nearing more like 90% usage...

in terms of pipelining, it's not so much the Hasher blocks I dont understand, it's the signals feeding into them.... For example, what are the different length shift registers for?  What is the definition of cur_w# and why do they need different lengths, or more specifically what do the specific lengths correlate to?  The previous hasher's output?  And on a fully unrolled loop - why are shift registers even necessary?  Shouldn't each hasher's digester essentially have the "register" of the state in there?
edit: ohh wait nm, that's the message scheduler!

Maybe not a full block diagram outling all the pipelined stages, but more of a "cell" diagram of a Hasher in terms of i.  E.g. Hasher i has connected to it's input Hasher i-1's cur_w0.  Something like that might help me figure out exactly what's going on.

And I guess that loops into your question on specifying signal names.  Personally I would have some sort of prepend to every wire/reg.  In VHDL there is no distinction and the behavior ( wire or reg ) is inferred through the design - e.g. signal is assigned a value in a clocked process ==> register.  And in VHDL I usually prepend all my signal names with sig_XXXXX.  One of the problems I have with single letter variable names is that they are impossible to search the document for references.  So you have a variable K - want to see how hard it is to search a document for references to the letter K?  If every K was instead sig_K, it would be much easier to search the document to find references.  Basically any single letter variable name IMO is bad.  

Some of the other signal names might be a mix of non-detailed name + my inexperience with the SHA algorithm.  For example, wtf does cur_w1 mean?  I understand a _fb = feedback.  But I don't know what w1 or w0 or w14 or w9 do.  Also, I'm unsure what a _w means, or a _w1, or a _t1. Or a prepend of cur_ - not exactly sure what that means either.

And although it may be easy and quick to type, the shift register definition also has 2 single letter registers, r and m - and this one isn't as bad because that stuff is internal, but imagine what a pain in the ass it is when you get a synthesis info/warning about some variable m - now I gotta search through all the source files by hand to look for a register m - because I can't just search for "m" in all the documents and get anything useful...

It might also help to organize the wire/reg definitions a little bit better.  The way it is now, definitions are strewn throughout the code.  I always prefer having my wire/reg/input/output declarations at the top of the module, like software coding.  It may also help to separate out the modules a little bit more.  The sha256_transform is so complicated already - maybe move all things like the digester or shift registers out of the same source file, that way the root sha256_transform module is more of a connectivity/hierarchy module defining the structure, not the fuction of the sha256 transform.  

But truthfully my understanding of the sha256 algorithm and it's pipelined version are probably a little bit lacking, and that is not helping to understand the code/flow.
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 [22] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!