I have an old Altera DE2-70 board I picked up for $300 (academic price) 1.5 years ago (Cyclone II). Looks like the current model is a DE2-115 based on the Cyclone III FPGA.
I just did a bit of research on existing SHA-256 implementations for FPGAs, and I see that several companies sell high performance FPGA implementations (e.g.
http://www.cast-inc.com/ip-cores/encryption/sha-256/index.html). Taking the Cast implementation as an example:
"The processing of one 512-bit block is performed in 66 clock cycles and the bit-rate achieved is 7.75Mbps / MHz on the input of the SHA256 core."
Taking a clock rate of 132MHz as a reasonably conservative number for my older Cyclone II (Cast claims up to 280MHz on high performance FPGAs), this comes out to 2Mhps per block. Cast's implementation uses around 2,531 LEs on the Cyclone. My older DE2-70 board contains about 68,000 LEs.
Adding 10% overhead for communication/synchronization/etc, it should be possible to put 24 SHA-256 processors on my DE2-70. That should allow up to 48Mhps peak processing rate (>80Mhps for the DE2-115 which can also be clocked faster).
Another question: How much communications bandwidth is needed at these speeds, and can it fit on a 100baseT channel? Certainly not if we want the host to transfer all of the candidates to be hashed onto the FPGA (48M * 512 = 12.3Gbps -- well above even gigabit ethernet speeds). Is there another approach that can overcome this limitation. I think so...
FPGAs have room for a dedicated CPU as well as a lot of logic, depending on what level of functionality you need in the CPU. There are a lot of free and powerful CPU cores available on opencores.org, but it will be hard to beat the Nios II architecture if you are using Altera FPGAs.
A 32-bit Nios II/f CPU core is capable of 140 MIPS of performance (at 125MHz) and uses 1600 LE's on the Cylcone II. Is this sufficient to keep 24 high-speed SHA-256 blocks from stalling? Not even close. In fact, it would probably not even be able to keep even one SHA-256 block from stalling. Back to the drawing board...
It looks like a better approach would be to implement the search logic directly in gates on the FPGA, and have it fill one or more 256-bit-wide queue(s) which would be drawn on by the SHA-256 processing blocks. A single NIOS II CPU still makes sense for collecting the results and communicating the results back to the host CPU (TCP/IP stack), as well as to load the search logic starting and ending values.
Anyway, my back-of-the-envelope calculations seem to confirm almost everything ArtForz is saying below. It looks like the ATI 5970s are the right choice if your goal is to crunch bitcoins.
OTOH, if you want an excuse to learn how to program FPGAs, you will certainly be able to run circles around a state-of-the-art hex-core i7 CPU with a pretty modest FPGA -- but at considerable effort.
Jason
The real issue on FPGA isnt the logic ops(cheap) or the rotates(pretty much free), but the 32-bit adds.
A_out = H + s0 + s1 + maj + ch + K + W
-> at least 3 level adder tree ((H + s0) + (s1 + maj)) + ((ch + K) + W)
Carry chain delay in a single 32-bit adder on a -3 speed grade Spartan6 is ~2ns, so without ANY routing delays we're already limited to 166MHz.
Real-world you're lucky to get 80MHz out of a non-pipelined round on a -3 S6
Pipelinining a round to 2 or 3 stages helps, but increases FF usage a LOT (you have to carry 256 bits of A..H, 512 bits of W[0..15] and the initial A..H for the final add around).
2-stage gives ~140MHz on a -3, 3-stage ~180MHz
= a 2-stage pipelined sha256 round is ~1k FFs, 3-stage pipelined ~1.5k FFs
XC6SLX150 has something like 160k FFs available, and the synthesis tools pretty much throw speed out the window once you go >70% FF utilization.
so realistically you MIGHT be able to fit 64 2-pipelined rounds of sha256 on a LX150, 2 clocks/bitcoinhash @ 140MHz -> 70Mh/s
or maybe with lots of luck and sacrificing a chicken to the place and route gods 48 rounds 3-stage @ 180MHz -> 68Mh/s
= 70Mh/s on a -3 speed grade XC6SLX150, 20%-30% less on a -2 speed grade.
so 9 grand for MAYBE 850Mh/s... a $500 HD5970 can get >550Mh/s stock, well >600Mh/s OCed at stock voltage even on a "bad" card.
okay, let's be REALLY generous, assume we can magically get 1.2Gh/s out of 12 150-2s and they consume NO POWER AT ALL.
So how long does it take at 600W for 2 5970s and $0.10/kWh to make up that $8k price difference? 0.6kW @ $0.10 kWh = $1.44/day ... about 15 years.