Bitcoin Forum
May 20, 2025, 01:49:27 PM *
News: Latest Bitcoin Core release: 29.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
Author Topic: Official Open Source FPGA Bitcoin Miner (Last Update: April 14th, 2013)  (Read 432985 times)
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 27, 2011, 07:34:45 AM
 #441

Quote
This would (for LOOP_LOG2=1) mean, that it will need to be fed a burst of work during 32 clocks, and then nothing at all during the next 32 clocks, which seems to be unneccessarily complicated to me.
It's fed "new" work every other clock cycle, confusingly enough. HASHERS[0], for example, will take new work at the first clock, old work at the second, new work at the third, old work at the fourth, and so on. It's really not a terrible way to do it, except for the long feedback chain required between the last and the first hasher, which is fine if those two instances are placed next to one another. ... and the fact that it's confusing and makes my brain squeal.

It may even be better, to be honest, because you have feedback in only one location instead of at every hasher. If you make a U shape out of the hasher placement it puts the first and last hasher next to each other for fast feedback there, and the rest can happily route forward. K_next is the only signal that changes the behavior of the hashers at each cycle, and is likely implemented as a 2:1 MUX at LOOP_LOG2=1.

Quote
Mine works in a different way. It feeds back a stage's outputs to its own inputs on every second clock, making HASHERS[0] handle rounds 0 and 1, and HASHERS[31] handle rounds 62 and 63. This way it just seems to work at half the frequency externally, no need for bursts.
That is how the normal code works. This is only applicable to makomk's latest revisions, where I think he chose to do a long chain loop instead of tight feedback because of how the W shift registers are implemented at LOOP_LOG2>0. He'll have to chime in here to verify, as I haven't worked all the logic out for myself.

Quote
Hm, this is weird. Maybe somehow related to verilog?
With my VHDL code, it tries to implement every single occurrence of two registers in a row as a shift register, leading to lots of unused flipflops and an unroutably high LUT usage. Limiting that to at least 4 registers in a row in the synthesis options improved things a lot.
Quite possibly. I am mostly just happy to finally have a routable design running on my LX150 at 100MHz  Tongue At least on Altera, Quartus couldn't figure out how to shove things into the M9Ks with vanilla code. makomk's version with explicit shifters is what got Quartus to use the M9Ks and cram the design into 75K LEs.

Perhaps the way VHDL is synthesized allows the compiler to better realize chains of registers? It seems odd, but I wouldn't put anything past these compilers Tongue

makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
July 27, 2011, 01:23:02 PM
Last edit: July 27, 2011, 01:50:52 PM by makomk
 #442

edit: update
if I use this for the K and K_next assignment when LOOP == 1, I don't get the LUT messages anymore:
Quote
`ifdef USE_RAM_FOR_KS
         if ( LOOP == 1) begin
            assign K = Ks_mem[ i ];
            assign K_next = Ks_mem[ i + 1 ];
         end else begin
...
I think the problem is that K and K_next are not assigned in a clock state, thus they become asynchronous combinatorial logic - and XST can't map that to a ROM?  Or maybe it's the addition of using a multiplier output as an address selector?  Something in there XST wasn't liking for me.
I'm guessing the reason you don't get the LUT RAM messages anymore is because xst is now interpreting that as a single constant value rather than as a table lookup. (Since i is a genvar, both i and i+1 are constant at synthesis time, and xst already knows Ks_mem is read-only.)

I was able to achieve 100MHz and route it with the current design, slightly modified.  Changed the PLL to output clk0 so no clock division, and I changed the K/K_next assignments as to what I described in my previous post.  This routes it to a min clock period of 9.742ns for me.
Congratulations, nice one! Wonder what exactly ISE was doing before...

It also seems like the worst case critical path related to the 100MHz clock is between Hasher[8] and Hasher[13], looks like an output of an adder in Hasher[8], adding rx_state and k_next gets registered into Hasher[13]'s shift_w1 register.
That's interesting.

Edit: Just noticed there was another page of discussion. fpgaminer's explanation is indeed correct, including for the bits I modified. (I'm afraid I'm actually responsible for a lot of the confusing bits, including the cur_ prefixes which are there to distinguish between the values of wn in this round and the variable new_w15 which holds the value of w15 used by the next round.)

It might be more useful to think of the w computation as being
Code:
w[i+16] = w[i] + s0(w[i+1]) + w[i+9] + s1(w[i+14])
because that's effectively how it's being calculated. cur_w<n> actually means w[<n>+i] and next_w15 will hold the value of w[i+16] one clock cycle later.

This is only applicable to makomk's latest revisions, where I think he chose to do a long chain loop instead of tight feedback because of how the W shift registers are implemented at LOOP_LOG2>0. He'll have to chime in here to verify, as I haven't worked all the logic out for myself.
That's one reason for the change, but it's probably not the only one; all those muxes at every stage in the pipeline aren't exactly cheap from what I can tell. (Also: don't forget the extra added clock cycle of delay in the feedback path in order to make each piece of data arrive at the input with cnt one higher than on its previous pass through the pipeline.)

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
magik
Newbie
*
Offline Offline

Activity: 44
Merit: 0


View Profile
July 28, 2011, 04:22:08 AM
 #443

just got off a long day of work, brain hurts, so I havn't really gone over what you've written here so far, but skimming it definitely seems like it will help me understanding everything.  When I get some free time I'll try to analyze it and work it all out, maybe even draw a single cell diagram for anyone else trying to wrap their heads around it.

On another note - the 2engine design I let running finally finished, and failed.  It ran out of placement sites - said it was short like 1k FFs and 3k LUTs to implement the design - and this was before routing, so it may not even have been able to attain 100 MHz there.  I am running it again, but it may take 2-3 days again - but this time I've enabled some more aggressive optimizations in the Xilinx ISE.

We might have to settle for having 1 fully unrolled engine, and then 1 LOOP_LOG2=1 engine running and that should hopefully get to ~150MHz pretty easily.

Has anyone else gotten a 100MHz version working?  Should I compile a bit file for someone to test it?  I hate flying blind with a target device to test on.
makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
July 28, 2011, 08:08:59 AM
 #444

On another note - the 2engine design I let running finally finished, and failed.  It ran out of placement sites - said it was short like 1k FFs and 3k LUTs to implement the design - and this was before routing, so it may not even have been able to attain 100 MHz there.  I am running it again, but it may take 2-3 days again - but this time I've enabled some more aggressive optimizations in the Xilinx ISE.
Ah. If you look at the list of LUTs it couldn't place, you'll almost certainly find that they're adders with carry chains: there aren't enough suitable sites with fast carry hardware to place all of the 32-bit-wide adders contiguously. I got the same error trying to fit a fully unrolled engine onto a XC6SLX75.

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
July 29, 2011, 01:32:19 PM
Last edit: July 30, 2011, 12:25:55 AM by makomk
 #445

Finally got a set of modifications to the K_next lookup that ISE likes! It reported 100 MHz at LOOP_LOG2=1 on XC6SLX75 and I've sent a pull request to fpgaminer. Looks like it might be a bit hit and miss as to whether it meets timing though, since the next run failed with an actual period of 10.412ns rather than the required 10ns.

Edit: Might help if I actually linked to it. It's in the projects/LX150_makomk_Test dir of the xilinx-better-k-for-merge branch
Edit 2: Now it's consistently getting 10.412ns *sigh*.
Edit 3: Trivial fix for this - in addition to using that branch, find the line:
Code:
.rx_t1_part(feedback ? t1_part_fb : (rx_state[`IDX(7)] + cur_w0 + K))
Amend it to read:
Code:
.rx_t1_part(feedback ? t1_part_fb : (rx_state[`IDX(7)] + rx_input[31:0] + K))
You should now be able to achieve 100MHz (50MHash/sec) on the XC6SLX75, at least in theory. It's late and I need to sleep so this isn't committed anywhere yet.

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
July 30, 2011, 12:42:51 PM
 #446

Quote
Trivial fix for this - in addition to using that branch, find the line:
Hmm ... that change doesn't seem like it should have any impact; ISE should be able to figure out that cur_w0 is rx_input[31:0] there unless I missed something. Was the timing analyzer pointing that out as being the trouble maker?

makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
July 30, 2011, 04:00:48 PM
 #447

Hmm ... that change doesn't seem like it should have any impact; ISE should be able to figure out that cur_w0 is rx_input[31:0] there unless I missed something.
It might for LOOP_LOG2=0, but it doesn't appear to for LOOP_LOG2=1. I've no idea why it doesn't; perhaps it doesn't know to make that particular optimization.

Was the timing analyzer pointing that out as being the trouble maker?
The timing analyzer was pointing to a long chain of logic, but it did indeed include an unnecessary mux between w0_fb and rx_input[31:0] feeding into the computation of the initial value of t1_part. Replacing that with a direct reference to rx_input[31:0] resulted in it reaching timing closure at 100MHz. (Again: this is all with LOOP_LOG2=1)

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
August 02, 2011, 12:10:36 PM
 #448

August 8th, 2011 - Spartan-6 Code Added and Working, Altera Mining Script Updated
Thanks to the efforts of makomk, the Spartan-6 series of chips are now supported, and achieve the highest performance per $ of any chip. Code has been verified working on my LX150 development kit, and can achieve up to 100MH/s of performance.

The Altera Tcl Mining Script has just received a massive update. No more need to edit mine.tcl to hack in your FPGA's hardware and device names; mine.tcl will automatically detect mining FPGAs connected to the system. Pool information has been moved to a config.tcl file for easy editing. No more dependency on TclCurl, so the script should be Linux friendly now. And best of all, the console output has been cleaned up to look like poclbm.

Here is a full list of updates:

  • No more dependency on TclCurl, so less hassle on Linux.
  • Auto-detection of mining FPGAs. No more need to edit mine.tcl to put in your FPGA's information. Just run and the script should find the FPGA automatically.
  • Pool url, worker username, and worker password split into config.tcl. See config.example.tcl.
  • All debug messages removed from console output.
  • Console output is timestamped.
  • Miner information reported to console similar to poclbm's console output.
  • Console reports FPGA actual hashing rate, estimated hashing rate, and accepted/rejected shares.
  • Better error handling.
  • Tracking accepted/rejected shares.
  • Generic API introduced for communication with the FPGA, allowing better future support for different chips/firmwares.
  • General code maintainability improvements.


Screenshot of the new console output:


Screenshot of the script automatically detecting my FPGA  Cool



There are still two missing major features. 1) It doesn't find and utilizing multiple connected FPGAs. 2) Long Polling
The firmware code also needs to be updated to support a work queue and results queue. All of these are planned future updates. I will also (hopefully) merge the Xilinx and Altera mining scripts together, for ease of use and mixing both kinds of chips on the same system Tongue

TheSeven
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500


FPGA Mining LLC


View Profile WWW
August 03, 2011, 08:43:35 PM
 #449

Do you want to get rid of the virtual wire / chipscope? If yes, we should probably decide on a common simple serial protocol for all these FPGA designs, so that one software will fit them all. I'll happily implement that in my python miner, so you don't need to care about the software side of things.

My tip jar: 13kwqR7B4WcSAJCYJH1eXQcxG5vVUwKAqY
pusle
Member
**
Offline Offline

Activity: 89
Merit: 10


View Profile
August 04, 2011, 04:32:15 AM
 #450


I second that.  Uarts are easily implemented in FPGA's and the obvious choice for a "universal" interface. (via usb mostly)

May I suggest that the protocol handles multiple devices connected in parallel by some form of adressing.
This way lots of fpga's can be connected together easily on a common board and served across a single com port. (similar to RS-485)

I guess this thread is as good place as any to discuss the protocol?



fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
August 04, 2011, 08:36:59 AM
 #451

Quote
If yes, we should probably decide on a common simple serial protocol for all these FPGA designs, so that one software will fit them all. I'll happily implement that in my python miner, so you don't need to care about the software side of things.
I completely agree, and Python is a far better choice for controller software than Tcl Tongue

I took the time to write out some preliminary specifications for the internal hardware interfaces:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/wiki/Specification:-Components,-Interfaces,-and-Protocols-%5BWIP%5D

TL;DR: I abstracted away all of the hardware and implementation specifics and shielded them with a single, memory mapped component called the Comm. Being memory mapped, it's a trivial interface that protocols like I2C and SPI can wrap easily.

Relevant to the topic at hand, the specifics of this wrapping by SPI, I2C, UART, JTAG, or what have you will be documented here:
https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/wiki/Specification:-PHY-Interfaces

And yes, I've abused the term PHY in that document and the previous specification. Because I'm a horrible person and I want to watch language burn for my own personal pleasure Tongue

Currently in that PHY specification I've only outlined the requirements. TL;DR: It must support multiple devices (addressing a single device specifically), allow reading registers, and writing registers. That's it.

I haven't laid down any ground work for any specific protocols yet, but I listed out what the most immediate ones probably are: UART, SPI, and JTAG. Feel free to begin specifying those in this thread.

And as always, feel free to critique my work, call me stupid, and redo the whole thing Tongue

fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
August 04, 2011, 10:13:50 AM
 #452

Regarding Performance Optimizing Spartan-6 LX150:
DANGER: Long detailed post coming... sorry, I hope the information is useful though

I'm working with my LX150_makomk_speed_Test project, where I'm trying to nail down the performance bottlenecks and remove them. I'm learning up FPGA Editor so I can better visualize what the router is doing, and I've read through some of the Spartan-6 UGs to get a better understanding of the architecture.

First off, I will say I am quite impressed by Xilinx's work and foresight on the logic of the S6's slices. They can perform a 3 component, 32-bit addition in 8 chained slices, with 4-bits being computed per slice. That blew my mind when I saw it in the FPGA Editor. This is great for our mining algorithm, and you can see why in this critical path analysis:

Code:
W:           16 slices + 0 slices = 16
tx_t1_part:  8  slices + 0 slices = 8

t1:          8  slices + 0 slices
tx_state[0]: 8  slices + 8 slices = 16
tx_state[4]: 8  slices + 8 slices = 16
The worst critical paths are only 16 slices long, with a single break in the carry chain (AFAIK). W is a 4-way, performing a 3-way of the first 3, and a 2-way of the result and the remaining component. tx_state[4] is a 2-way with t1 and rx_state[3].

I haven't fully analyzed the router's behavior on the 2-way's yet, but it appears to include work from other operations ... somehow. Not sure yet.

So, that's the good news. The bad news is, of course, only half of the slices are useful. There are two slices in a CLB. One slice always has fast-carry logic and chains to the slice directly above it (in the CLB above it). The other slice is a lowest form of life slice. It's still a powerful slice, with 4 6-LUTs (or 8 5-LUTs, or combinations thereof), and 8 flip flops, but the mining algorithm has rare use for it.

The next bad news is, only half of the "good" slices can be used as RAM or shift registers. That's not a terrible thing since most will be consumed as adders anyway.

And that's about all I could find that's particularly good or bad with the S6 slices. Since the good slices are all in columns, and spaced evenly, the impact of the useless slices should actually be far less severe than I thought.


For the S6's routing architecture, the quick overview basically said routing costs roughly Manhattan Distance between CLBs. I haven't dug into the details more than that at this point.



With that knowledge in hand, and some beginner's experience with FPGA Editor, I dived in and found what appears to be the largest bottleneck in the current code:

Code:
Slack (setup path):     0.264ns (requirement - (data path - clock path skew + uncertainty))
  Source:               uut2/HASHERS[41].shift_w1/Mram_m (RAM)
  Destination:          uut2/HASHERS[41].upd_w/tx_w15_30 (FF)
  Requirement:          12.500ns
  Data Path Delay:      11.716ns (Levels of Logic = 6)
  Clock Path Skew:      -0.260ns (0.780 - 1.040)
  Source Clock:         hash_clk rising at 0.000ns
  Destination Clock:    hash_clk rising at 12.500ns
  Clock Uncertainty:    0.260ns

  Clock Uncertainty:          0.260ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.450ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: uut2/HASHERS[41].shift_w1/Mram_m to uut2/HASHERS[41].upd_w/tx_w15_30
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    RAMB16_X2Y46.DOA2    Trcko_DOA             1.850   uut2/HASHERS[41].shift_w1/Mram_m
                                                       uut2/HASHERS[41].shift_w1/Mram_m
    SLICE_X60Y126.A2     net (fanout=4)        5.845   uut2/HASHERS[41].cur_w1<2>
    SLICE_X60Y126.COUT   Topcya                0.379   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_lut<16>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19>
    SLICE_X60Y127.CIN    net (fanout=1)        0.003   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19>
    SLICE_X60Y127.BMUX   Tcinb                 0.292   uut2/HASHERS[41].shift_w0/r<27>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<23>
    SLICE_X78Y122.B3     net (fanout=1)        1.995   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd_211
    SLICE_X78Y122.BMUX   Tilo                  0.251   uut2/HASHERS[41].upd_w/tx_w15<23>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd221
    SLICE_X78Y122.C5     net (fanout=2)        0.383   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd221
    SLICE_X78Y122.COUT   Topcyc                0.295   uut2/HASHERS[41].upd_w/tx_w15<23>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_lut<0>22
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>_22
    SLICE_X78Y123.CIN    net (fanout=1)        0.003   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>23
    SLICE_X78Y123.COUT   Tbyp                  0.076   uut2/HASHERS[41].upd_w/tx_w15<27>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>_26
    SLICE_X78Y124.CIN    net (fanout=1)        0.003   uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>27
    SLICE_X78Y124.CLK    Tcinck                0.341   uut2/HASHERS[41].upd_w/tx_w15<31>
                                                       uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_xor<0>_30
                                                       uut2/HASHERS[41].upd_w/tx_w15_30
    -------------------------------------------------  ---------------------------
    Total                                     11.716ns (3.484ns logic, 8.232ns route)
                                                       (29.7% logic, 70.3% route)

It's being forced to route from a RAMB16BWER to a CLB that's right smack dab in the middle of a group of columns, furthest possible position from possible RAM locations. Here, check this image out, it will make you go insane, so don't stare too long:
https://i.imgur.com/gBv5R.png (RAM is on the left).
No, seriously, don't stare at it. The router will drive insanity into the depths of your soon rotting brain fleshes. After exploding the Universe, of course.

Oh, you looked at it anyway and are wondering about that little path heading downward? Yeah, it keeps going ... and going ... (into my damned soul).


And as you can read from the timing report above, routing accounts for *drum roll* 70.3%! Yay! That's 8ns of routing, and only 3.4ns of logic! Imagine if we got rid of all the routing...


I see four solutions at the moment, and will investigate as time allows:

1) Get rid of the RAMB16BWER to some extent.
2) Add an extra register to the output of shifter_32b when inferring RAM logic. Flip-flops should route close to the logic and mask RAM routing delay.
3) Add two, duplicate registers to the output of shifter_32b when inferring RAM logic.
4) Ditch the RAM infer completely and try to coax ISE into using all those flip-flops in the useless slices (which are peppered throughout the routed design at the moment).

I will try 3 first, and hope ISE does the intelligent thing. My hope is that flip-flops in the useless slices will get utilized, since they're mingled in with the useful logic and so should provide somewhat fast local routing.

The interesting this is that we've got lots of RAM to play with. The design is using ~30% of the 16BWERs, and none of the 8BWERs. It seems like a good idea to try to use them and bring slice consumption down if possible, but only if their awkward placement can be solved appropriately.

makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
August 04, 2011, 11:31:27 AM
 #453

First off, I will say I am quite impressed by Xilinx's work and foresight on the logic of the S6's slices. They can perform a 3 component, 32-bit addition in 8 chained slices, with 4-bits being computed per slice. That blew my mind when I saw it in the FPGA Editor. This is great for our mining algorithm, and you can see why in this critical path analysis:
Yeah, neat isn't it? That's a big part of the reason I was initially confident about Spartan 6 performance, at least until I found out about the catch...

So, that's the good news. The bad news is, of course, only half of the slices are useful. There are two slices in a CLB. One slice always has fast-carry logic and chains to the slice directly above it (in the CLB above it). The other slice is a lowest form of life slice. It's still a powerful slice, with 4 6-LUTs (or 8 5-LUTs, or combinations thereof), and 8 flip flops, but the mining algorithm has rare use for it.
Which is a tad irritating.

It's being forced to route from a RAMB16BWER to a CLB that's right smack dab in the middle of a group of columns, furthest possible position from possible RAM locations. Here, check this image out, it will make you go insane, so don't stare too long:
https://i.imgur.com/gBv5R.png (RAM is on the left).
No, seriously, don't stare at it. The router will drive insanity into the depths of your soon rotting brain fleshes. After exploding the Universe, of course.
Whoops, you're right. It appears to be doing the something similar on the SLX75 too; I should've spotted that sooner. (Judging from the timing report at least; I haven't looked in FPGA Editor because the Linux version is a pain in the ass to actually run on my distro. It uses some grotty ancient Motif-based wrapper library that emulates some prehistoric version of the Windows API and requires deep magic to get it to launch.)

The interesting this is that we've got lots of RAM to play with. The design is using ~30% of the 16BWERs, and none of the 8BWERs. It seems like a good idea to try to use them and bring slice consumption down if possible, but only if their awkward placement can be solved appropriately.

The 8BWERs suffer from some slightly annoying errata. Also I think they may actually be shared with the 16BWERs and so are not actually all available, but don't quote me on that. Not sure to what use they can be put either.

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
fpgaminer (OP)
Hero Member
*****
Offline Offline

Activity: 560
Merit: 517



View Profile WWW
August 04, 2011, 12:07:12 PM
 #454

Quote
Not sure to what use they can be put either.
I'm guessing you could pair them up to get the equivalent of a 16BWER, but I'm not sure.

I just tried adding an extra register to the shifter's inferred RAM. After compilation it failed timing (80MHz) and ... the register was gone. I'm guessing it optimized the register away somehow, or balanced it. Either way, it ended up having a negative impact. I'm running another compile with USE_XILINX_BRAM_FOR_W off to see how that works. Perhaps we need to find a way for ISE not to optimize the shifter so much when USE_XILINX_BRAM_FOR_W is being used?

makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
August 04, 2011, 12:21:33 PM
 #455

I just tried adding an extra register to the shifter's inferred RAM. After compilation it failed timing (80MHz) and ... the register was gone. I'm guessing it optimized the register away somehow, or balanced it. Either way, it ended up having a negative impact. I'm running another compile with USE_XILINX_BRAM_FOR_W off to see how that works. Perhaps we need to find a way for ISE not to optimize the shifter so much when USE_XILINX_BRAM_FOR_W is being used?
If you haven't tweaked ISE's setting for the minimum size of shift register to infer yet, you probably need to do that. (By default two or more registers in a row are automatically replaced by a shift register.)

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
pusle
Member
**
Offline Offline

Activity: 89
Merit: 10


View Profile
August 04, 2011, 01:07:21 PM
 #456


About the interfaces/protocol:

The data would most likely end up going through a com port eventually.  Might it not be better to have an 8bit data width in the registers etc ?
Then you don't have to worry about dis/assembling the 32bit words either and the usual what are LSB and MSB, endian ARGH! Tongue

These huge 512/256 bit internal busses between slave, master and comm will eat quite a bit of routing resources.
I would suggest having an internal serial format using the system clock + 1 bit bus and a clock enable/data valid ctrl signal.
Sender has a tiny state machine/counter, while the receiver module is only a shift register with a clock enable.

To make up for the tiny tiny waste in waiting for new data your could have double buffers in the slaves.


makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
August 04, 2011, 08:43:29 PM
 #457

If you haven't tweaked ISE's setting for the minimum size of shift register to infer yet, you probably need to do that. (By default two or more registers in a row are automatically replaced by a shift register.)

Hmmm. Actually, that doesn't seem to quite solve the problem. Try something like this patch. Causes slightly worse timing elsewhere on XC6SLX75 so it's actually a net loss there, but you may be able to avoid or work around this. (Edit: Also, KEEP is probably not the right constraint to use here; it will cause unneeded registers and logic and RAM we want to elimiate to be kept.)

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
August 10, 2011, 06:10:10 PM
 #458

Finally got around to coding some maximum clock speed improvements for users of smaller Cyclone III and IV devices - now available from my new partial-unroll-speed branch. Expected minimum device size and speed is roughly as follows:

LOOP_LOG2=0: minimum device size ~75k LEs, maximum theoretical speed 110 MHash/s @ 110 MHz. Use the DE115_makomk_mod directory in fpgaminer's main git repository.
LOOP_LOG2=1: minimum device size ~50k LEs, max theoretical speed 50 MHash/s @ 100 MHz. Use DE2_115_Unoptimized_Pipelined directory in the partial-unroll-speed branch of my github repository.
LOOP_LOG2=2: minimum device size ~30k LEs, max theoretical speed 25 MHash/s @ 100 MHz. Use same partial-unroll-speed code, but edit the line after "`ifdef USE_EXPLICIT_ALTSHIFT_FOR_W" in sha256_transform.v to read "if(LENGTH >= 5) begin" rather than 4.
LOOP_LOG2=3: minimum device size ~20k LEs, max theoretical speed 13.75 MHash/s @ 110 MHz. Use partial-unroll-speed again, no modifications required.
LOOP_LOG2>3: not supported by either of these two code code versions, stick to the original partially-unrolled code in the DE2_115_Unoptimized_Pipelined directory of fpgaminer's git repository? (Bear in mind it may not be very area-efficient.)

It may be possible to get some of these up to 110 MHz or fit them onto slightly smaller devices (actual LE usage is currently about 74k, 46k, 26k?, and 16k respectively).

Also: remember that some or all of these may require extra cooling in order to operate safely.

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
jonand
Newbie
*
Offline Offline

Activity: 12
Merit: 0


View Profile
August 12, 2011, 12:50:12 PM
 #459

I've been playing around with the xilinx-verilog port in the github repo and can confirm that it works just fine on the Xilinx Spartan-6 XC6LX9 microboard eval board from Avnet for $69.

Couldn't fit anything but unrolling = 5 so speed is a few Mhash/s only, but I really recommend anyone who want to have a look at FPGAs to get this little USB "dongle" sized board. It has built-in usb jtag-cable and usb-serial console so nothing extra is required to get it up and running teknohog's code.

Let me know if anyone wants the UCF file, I just took the pin numbers from the avnet UCF example and replaced the UCF from github.

It is also nice to replace the 7segment with a simple LED output to see that your serial comm is working.

I tried clocks up to >90MHz and it runs just fine.

As a courtesy to teknohog I used his account (in the miner.py) when mining so he got the reward for the few shares I found..

Oh, I actually used windows7 for this, I couldn't get the digilent xilinx cable drivers to work on my linux box. It was easier to install python on the windows laptop.


You can also look at it this way: Learn FPGA programming, get a job and make more money than you most likely ever will on mining bitcoins! Smiley

jonand
Newbie
*
Offline Offline

Activity: 12
Merit: 0


View Profile
August 12, 2011, 02:21:50 PM
 #460

The teknohog xilinx-verilog port also works good with the Xilinx Spartan 6 LX45T eval board called "SP605".

The only trick is that TX and RX in the SiLabs USB-RS232 adapter is reversed so you need to swap the pins in the UCF file to make a null modem.

After finding 7 blocks I have a hash rate of 13.1 +-4.95 with a clock speed of 63MHz and unrolling = 3.


I'm not saying these boards are useful for any real life mining but if you need a working reference example this does the job.

You can test run these devices with the xilinx ISE webpack license (which reports some stats back to xilinx but has a $0 cost). The only thing you risk is a salesguy calling you and asking when you will be purchasing volume..  Tongue

Good luck, have fun!

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!