Im currently learning to programm the msp 430. Anybody interested is invited to join me as this is the biggest obstacle for our project to continue. I'd be happy to write firmware for the MSP430, and supporting PC software. Heck, I even have three MSP430 kits I couldn't resist their $4+free shipping price. Correct me if I am mistaken: The MSP430 you have selected will be present on each DIMM board. It will be an MSP430 with built-in USB support. The MSP430 will be connected by SPI to pins on both of the FPGAs. If that is correct I can: 1) Write the SPI module and test it on my Spartan-6 dev kit. 2) Use my MSP430 dev kit to write the MSP430 code and talk to my S6 dev kit. 3) Write a Python interface, console/UI, for the PC software to talk to the MSP430. Caveat: My MSP430 dev kit has the lowest form of life MSP430 on it. USB is supported through extra chips, so I don't have an actual MSP430 with built-in USB support. So I, or someone else, will have to eventually get one of those chips and tweak up the code to adjust for any differences. Please correct my understanding of the current design I quoted above, and if the steps I listed seem correct. If all is well, I'll go off and get your firmware and software all ready
|
|
|
so basically similar algo optimizations were implemented on your fpga code? Not really. Those are GPU optimizations, and are focused on reducing the total number of calculations needed to compute the final hash. That benefits the GPU's performance, but none of them would really have a useful impact on an FPGA's MH/s performance. So I haven't implemented most of them, because they aren't too helpful, would clutter the code, and my time is better spent elsewhere for now.
|
|
|
I'm mostly just the firmware designer, but I will try to chime in on a few things here for everyone's benefit First and foremost, I do not think it has been made clear that these boards are firmware upgradeable. As improvements are made to the Open Source FPGA Bitcoin Mining project, new firmware will be generated and made available. Just like we see improvements to GPU software and you can drive more out of your dusty 5850s Personally, I am very excited about this little board as a development platform Beats the heck out of my $1000 dev kit Connecting Multiple BoardsAs newMeat1 has hinted at, this was being fleshed out and worked on even before these first boards were sent into production. I'll be playing around with FTDI chips and/or micros, and see if we can get a nice, simple solution that will be backwards compatible with existing boards, streamline future designs, and scale well for multiple boards. This is really really cool and I'm glad to see a working board for sale so quickly! Not needing PCIe is like a dream and mining via usb on my old AMD duron machine would rock! Best of luck! I have an ~8 year old laptop that could probably drive these things All current GPU mining rigs can also drive it, in addition to driving their GPUs. Also, is there any chance that the FPGA miner code will run on a Spartan XC2S30? The Spartan-2 series is likely too old to push many MH, if any. That particular chip only has ~1K CLBs, for example. Are you certain? Aren't the LUTs involved in routing too? I always thought that they were either configured as logic or routing node. Think of an FPGA as a massive breadboard, with LUTs glued onto it in columns and rows. The breadboards are your "routing fabric" and allow you to choose how you connect the LUTs, and of course you can load the LUTs with whatever configuration you want. That's your most basic FPGA architecture. Just note that there is a "routing cost" proportional to the distance you try to route. For example, connecting one LUT to another half way across your massive breadboard would lead to a massive signal delay. On optimization of mining algo (be it on GPUs, CPUs or FPGAs) is there a post that describes all available optimizations? This thread gives a good run down of all the optimizations being implemented on GPUs.
|
|
|
well i had used LX150_test. and i dnt have a lx150 dev board, so i think i will just share my ideas here.. Oops, sorry, LX150_Test isn't really usable at the moment. I really need to add a useful README outlining all those different project variations ... Thank you for contributing your idea! Please take a look at the project variation I linked: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_TestYou will find that your idea, for the most part, has already been implemented in there. Specifically look around this line. BUT: You did point something out that I think I missed. In the code I linked you'll see that the pre-calculated T1 value is stored in a separate register, not tx_state[7] as you listed in your example. On looking at my code, I believe you are correct; tx_state[7] is never used (except for the last round) so it could be removed or replaced with the partial calculation. Good catch, Anoynomous! Not sure if the compiler catches this optimization automatically or not. again s0_w can be calculated a loop ahead and added to rx_w[31:0]. this way our new_w will be shortened to: Now that, I hadn't thought of. Another fantastic catch, Anoynomous! Double check me on this: tx_pre_w <= s0(rx_w[2]) + rx_w[1]; // Calculate the next round's s0 + the next round's w[0]. tx_new_w <= s1(rx_w[14]) + rx_w[9] + rx_pre_w;
if the above solution is applied, the calculation of new_w will be the new critical path... The calculation of tx_state[0] is the current critical path: t1 = rx_t1_part + e1_w + ch_w tx_state[0] <= t1 + e0_w + maj_w;
Which is actually pretty good, since it's implemented as only two adders.
|
|
|
i am having a little trouble here. I had some experience in designing sha1 hash cracker on fpga, so this project caught my interest. When i downloaded the code and tried to compile it for S6 lx150, it took about an hour to just synthesize the code and then the software said i had overused my resources.. so i wanted to knw, where did i go wrong?... Which project did you use? For S6-LX150, this is probably the preferred project to start from: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/tree/master/projects/LX150_makomk_TestYou'll want to adjust main_pll.v:98 to 5 for 50MHz, to make the compile easier and the firmware actually usable (assuming you have the S6-LX150T dev board) without cooling.
|
|
|
My criticism of this design (your design?) is that there is too much pipelining. Thank you for the criticism. I really do appreciate the feedback, and I am by no means an expert My intuition is similar to yours, in that a more traditional serial design should achieve better utilization and performance on the Spartan-6 architecture. But it is very easy to underestimate the massive amount of optimizations that occur in the fully unrolled design that takes my current primary focus. I have a functioning serial implementation, but so far my estimates for its total performance once put in parallel on the S6-LX150 is not exciting. Something like 120MH/s of performance. It's in the back of my mind, and there is plenty more work to be done in optimizing and perfecting it, but it hasn't shown me enough promise to warrant being in my mental spotlight like the unrolled design. The logic you are using to compute the basic hashes is not optimal, and you have not spent any time trying to optimize for your critical path. The current critical path is approximately two 3-way 32-bit adders implemented as 16 total slices, thanks to the Spartan-6 fast carry look ahead chains. Is there a means of optimizating that logic that I have missed?
|
|
|
200MH is simply way out of the question for an S6-LX150. That won't stop me from trying As far as I can tell with the poking around I've done so far, the current bottleneck on the S6-LX150 is the far dependencies caused by the W calculations. These references make it so that the rounds are not isolated, and so cannot be routed into a uniform chain. This forces ISE to do completely absurd routing, splattering the placement of a round's components across a good 1/4th of the chip. And that, obviously, leads to massive routing delays. On my last few compiles, the worst-case paths were >80% routing (8ns+ of routing, with 2ns of logic). If W is buffered between each round as a 512-bit register, instead of chains of shift registers and BRAMs, then the rounds can be isolated, but ISE fails to Map such a design for reasons I have not yet nailed down. 512-bits*~100 is quite a lot of registers If I, or someone else, can find a way to isolate the rounds and put them into a more consistent chain, then I highly suspect that both performance and area will improve considerably. I may create a "fake" design that focuses specifically on the W calculations (without digester rounds), and see if I can somehow get them routed into a sensible structure (even if it requires manual placement )
|
|
|
Still, why not do the share/H6 test in GPU - it would certainly be faster - shares are also rare compared to a job (about 1 in 2 billion) Is that an issue with the CL not being able to be changed based on the difficulty? There are several reasons. 99.99% of the time the mining software only needs to look for Difficulty 1 (a share, H7==0), so there is rarely the needed to check for anything else. GPU's absolutely hate branching; a full Difficulty check involves many branches. Smaller GPU programs are better GPU programs. The CPU runs in parallel to the GPU. Since the CPU is fully capable of checking for extra Difficulty levels, why would you burden the GPU with such work? The CPU should double-check the GPU's results anyway, to detect errors. Since the CPU will thus be recomputing the full two SHA-256 passes for each result returned by the GPU, it again makes sense to only check for higher difficulties on the CPU.
|
|
|
I've compiled a Win32 EXE for my poclbm fork (which has phatk, phatk2, phatk2.1, and phatk2.2 support): http://www.bitcoin-mining.com/poclbm-progranism-win32-20110814a.zipmd5sum - df623a45f8cb0a50fcded92728f12c14 Let me know if it works, I was only able to test it on one machine so far. Well I've been talking to a few people about this but got no real response from anyone, that it was possible ... The optimization you've spelled out is more or less already implemented in most, if not all GPU miners. The way GPU miners currently work is that they check in the GPU code whether h7==0. If it does, the result (a nonce) is returned, otherwise nothing is returned. It is the responsibility of the CPU software to do any further difficulty checks if needed. Since the only thing the GPU miners care about is H7, they completely skip the last 3 rounds (stopping after the 61st round). Also note, that GPU miners don't calculate the first 3 rounds of the first pass. Those rounds are pre-computed, because the inputs to those rounds remains constant for a given unit of getwork. So a GPU miner really only computes a grand total of 122 rounds, minus various other small pre-calculations here and there.
|
|
|
Finally got around to coding some maximum clock speed improvements for users of smaller Cyclone III and IV devices - now available from my new partial-unroll-speed branch. Expected minimum device size and speed is roughly as follows: More fantastic work, makomk! *applause* I've been playing around with the xilinx-verilog port in the github repo and can confirm that it works just fine on the Xilinx Spartan-6 XC6LX9 microboard eval board from Avnet for $69. Thank you for taking the time to share your experiences with all these mini eval boards, jonand. That's great information. It's a shame that LX9 microboard uses a 324 landing. Would be neat to re-solder an LX150 to it, but the LX150 doesn't come in 324 package :/ But finally, my experiment is working @ 2250 Mhash/s and about 100w! The cost is out of control, but since I had these cards laying around from other experiments, I figured I'd give it a shot. *drools* For reference, an AMD 5850 only gets ~350MH/s for ~150W.
|
|
|
Not sure to what use they can be put either. I'm guessing you could pair them up to get the equivalent of a 16BWER, but I'm not sure. I just tried adding an extra register to the shifter's inferred RAM. After compilation it failed timing (80MHz) and ... the register was gone. I'm guessing it optimized the register away somehow, or balanced it. Either way, it ended up having a negative impact. I'm running another compile with USE_XILINX_BRAM_FOR_W off to see how that works. Perhaps we need to find a way for ISE not to optimize the shifter so much when USE_XILINX_BRAM_FOR_W is being used?
|
|
|
Regarding Performance Optimizing Spartan-6 LX150: DANGER: Long detailed post coming... sorry, I hope the information is useful thoughI'm working with my LX150_makomk_speed_Test project, where I'm trying to nail down the performance bottlenecks and remove them. I'm learning up FPGA Editor so I can better visualize what the router is doing, and I've read through some of the Spartan-6 UGs to get a better understanding of the architecture. First off, I will say I am quite impressed by Xilinx's work and foresight on the logic of the S6's slices. They can perform a 3 component, 32-bit addition in 8 chained slices, with 4-bits being computed per slice. That blew my mind when I saw it in the FPGA Editor. This is great for our mining algorithm, and you can see why in this critical path analysis: W: 16 slices + 0 slices = 16 tx_t1_part: 8 slices + 0 slices = 8
t1: 8 slices + 0 slices tx_state[0]: 8 slices + 8 slices = 16 tx_state[4]: 8 slices + 8 slices = 16
The worst critical paths are only 16 slices long, with a single break in the carry chain (AFAIK). W is a 4-way, performing a 3-way of the first 3, and a 2-way of the result and the remaining component. tx_state[4] is a 2-way with t1 and rx_state[3]. I haven't fully analyzed the router's behavior on the 2-way's yet, but it appears to include work from other operations ... somehow. Not sure yet. So, that's the good news. The bad news is, of course, only half of the slices are useful. There are two slices in a CLB. One slice always has fast-carry logic and chains to the slice directly above it (in the CLB above it). The other slice is a lowest form of life slice. It's still a powerful slice, with 4 6-LUTs (or 8 5-LUTs, or combinations thereof), and 8 flip flops, but the mining algorithm has rare use for it. The next bad news is, only half of the "good" slices can be used as RAM or shift registers. That's not a terrible thing since most will be consumed as adders anyway. And that's about all I could find that's particularly good or bad with the S6 slices. Since the good slices are all in columns, and spaced evenly, the impact of the useless slices should actually be far less severe than I thought. For the S6's routing architecture, the quick overview basically said routing costs roughly Manhattan Distance between CLBs. I haven't dug into the details more than that at this point. With that knowledge in hand, and some beginner's experience with FPGA Editor, I dived in and found what appears to be the largest bottleneck in the current code: Slack (setup path): 0.264ns (requirement - (data path - clock path skew + uncertainty)) Source: uut2/HASHERS[41].shift_w1/Mram_m (RAM) Destination: uut2/HASHERS[41].upd_w/tx_w15_30 (FF) Requirement: 12.500ns Data Path Delay: 11.716ns (Levels of Logic = 6) Clock Path Skew: -0.260ns (0.780 - 1.040) Source Clock: hash_clk rising at 0.000ns Destination Clock: hash_clk rising at 12.500ns Clock Uncertainty: 0.260ns
Clock Uncertainty: 0.260ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE Total System Jitter (TSJ): 0.070ns Total Input Jitter (TIJ): 0.000ns Discrete Jitter (DJ): 0.450ns Phase Error (PE): 0.000ns
Maximum Data Path at Slow Process Corner: uut2/HASHERS[41].shift_w1/Mram_m to uut2/HASHERS[41].upd_w/tx_w15_30 Location Delay type Delay(ns) Physical Resource Logical Resource(s) ------------------------------------------------- ------------------- RAMB16_X2Y46.DOA2 Trcko_DOA 1.850 uut2/HASHERS[41].shift_w1/Mram_m uut2/HASHERS[41].shift_w1/Mram_m SLICE_X60Y126.A2 net (fanout=4) 5.845 uut2/HASHERS[41].cur_w1<2> SLICE_X60Y126.COUT Topcya 0.379 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19> uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_lut<16> uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19> SLICE_X60Y127.CIN net (fanout=1) 0.003 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<19> SLICE_X60Y127.BMUX Tcinb 0.292 uut2/HASHERS[41].shift_w0/r<27> uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd1_cy<23> SLICE_X78Y122.B3 net (fanout=1) 1.995 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd_211 SLICE_X78Y122.BMUX Tilo 0.251 uut2/HASHERS[41].upd_w/tx_w15<23> uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd221 SLICE_X78Y122.C5 net (fanout=2) 0.383 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd221 SLICE_X78Y122.COUT Topcyc 0.295 uut2/HASHERS[41].upd_w/tx_w15<23> uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_lut<0>22 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>_22 SLICE_X78Y123.CIN net (fanout=1) 0.003 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>23 SLICE_X78Y123.COUT Tbyp 0.076 uut2/HASHERS[41].upd_w/tx_w15<27> uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>_26 SLICE_X78Y124.CIN net (fanout=1) 0.003 uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_cy<0>27 SLICE_X78Y124.CLK Tcinck 0.341 uut2/HASHERS[41].upd_w/tx_w15<31> uut2/HASHERS[41].upd_w/ADDERTREE_INTERNAL_Madd2_xor<0>_30 uut2/HASHERS[41].upd_w/tx_w15_30 ------------------------------------------------- --------------------------- Total 11.716ns (3.484ns logic, 8.232ns route) (29.7% logic, 70.3% route)
It's being forced to route from a RAMB16BWER to a CLB that's right smack dab in the middle of a group of columns, furthest possible position from possible RAM locations. Here, check this image out, it will make you go insane, so don't stare too long: https://i.imgur.com/gBv5R.png (RAM is on the left). No, seriously, don't stare at it. The router will drive insanity into the depths of your soon rotting brain fleshes. After exploding the Universe, of course.Oh, you looked at it anyway and are wondering about that little path heading downward? Yeah, it keeps going ... and going ... (into my damned soul). And as you can read from the timing report above, routing accounts for *drum roll* 70.3%! Yay! That's 8ns of routing, and only 3.4ns of logic! Imagine if we got rid of all the routing... I see four solutions at the moment, and will investigate as time allows: 1) Get rid of the RAMB16BWER to some extent. 2) Add an extra register to the output of shifter_32b when inferring RAM logic. Flip-flops should route close to the logic and mask RAM routing delay. 3) Add two, duplicate registers to the output of shifter_32b when inferring RAM logic. 4) Ditch the RAM infer completely and try to coax ISE into using all those flip-flops in the useless slices (which are peppered throughout the routed design at the moment). I will try 3 first, and hope ISE does the intelligent thing. My hope is that flip-flops in the useless slices will get utilized, since they're mingled in with the useful logic and so should provide somewhat fast local routing. The interesting this is that we've got lots of RAM to play with. The design is using ~30% of the 16BWERs, and none of the 8BWERs. It seems like a good idea to try to use them and bring slice consumption down if possible, but only if their awkward placement can be solved appropriately.
|
|
|
If yes, we should probably decide on a common simple serial protocol for all these FPGA designs, so that one software will fit them all. I'll happily implement that in my python miner, so you don't need to care about the software side of things. I completely agree, and Python is a far better choice for controller software than Tcl I took the time to write out some preliminary specifications for the internal hardware interfaces: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/wiki/Specification:-Components,-Interfaces,-and-Protocols-%5BWIP%5DTL;DR: I abstracted away all of the hardware and implementation specifics and shielded them with a single, memory mapped component called the Comm. Being memory mapped, it's a trivial interface that protocols like I2C and SPI can wrap easily. Relevant to the topic at hand, the specifics of this wrapping by SPI, I2C, UART, JTAG, or what have you will be documented here: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner/wiki/Specification:-PHY-InterfacesAnd yes, I've abused the term PHY in that document and the previous specification. Because I'm a horrible person and I want to watch language burn for my own personal pleasure Currently in that PHY specification I've only outlined the requirements. TL;DR: It must support multiple devices (addressing a single device specifically), allow reading registers, and writing registers. That's it. I haven't laid down any ground work for any specific protocols yet, but I listed out what the most immediate ones probably are: UART, SPI, and JTAG. Feel free to begin specifying those in this thread. And as always, feel free to critique my work, call me stupid, and redo the whole thing
|
|
|
if I take x number of some fpga chips, assume a bunch of other things, here's how to do a rough estimate on the expected hashing rate.... That's what I'm after. I can't find that post... It's easiest to do with Altera Cyclone chips, and roughly works out to: 1,000 LEs = 1MH/s Could be more in some cases, less in others, but that's a fairly good baseline. For Spartan-6 it's roughly: 1,000 LEs = 0.5MH/s For high-performance families like Virtex or Stratix you should multiply by 3 or 4.
|
|
|
I'm hoping someone here can help me decipher this mystery. poclbm displays two hashing rates when it is running. The real hashing rate, and the estimated hashing rate: The first rate is calculated directly from the number of hashes the GPU has processed. So it should be the most accurate number. The second number is the estimated hashing rate, and it's calculated from the number of shares poclbm actually submits within a window of time (15 minutes by default). It should not directly reflect the GPU's actual processing speed, but rather it's a good estimate of "all things considered" hashing rate. I made a small modification to the code for that second value: if self.options.estimate == -1: total_shares = self.share_count[1] + self.share_count[0] estimated_rate = Decimal(total_shares) * (work.targetQ) / int(now - start_time) / 1000 else: If I use the command line option "-e -1" it will now estimate hashing rate over the entire run-time of poclbm, and all submitted hashes, even rejected ones. I wanted this, so I could be sure that experimental kernels were really producing the expected hashrate. For example, if the kernel or myself screwed up nonce calculations it could end up re-hashing the same nonce, and hence reduce real performance. The Problem: In the above screenshot, I ran with "-e -1" for a hair under a day now ... and estimated has remained at approximately 363MH/s since last night. That's 9 MH/s (2.5%) more than what should be the most accurate number. Can anyone figure out what might account for this discrepancy? As far as I can tell, share_count should be accurate; no share should be counted twice. All the code is here: https://github.com/progranism/poclbmAnd should be up-to-date with m0mchil's code except for my modifications.
|
|
|
Add in 8 values: 1 vs. naive 8 (saves 7) Theoretically you don't need that last add either: if(h64 + state[7] == 0) // Yay! Money! optimized to -> if(h64 == -state[7]) // Yay! Money! Where state[7] is constant. So you save 8 ops, instead of just 7. I don't know if this is applicable to GPUs specifically (perhaps it's cheaper to add than compare to 0!?), but it's an optimization in general.
|
|
|
I had a class about how CPUs work internally last semester and think this might be interesting. It'd probably be prohibitively expensive, but how about having a fully pipelined SHA256? https://bitcointalk.org/index.php?topic=9047.0is a cpu, FPGA, gpu simultaneous mining rig possible say on a 6 core i7, or an amd opteron? Sure is. I've run GPU + FPGA before. CPU is worthless since I pay for my electricity
|
|
|
|