I've based some of my research on topic, basically used previous work of ngzhang
where he have identified chip: https://bitcointalk.org/index.php?topic=79825.0
Unfortunately I do not know exact speed grade of used devices.
In Quartus using my "prototype" code that I used for Hardcopy IV evaluation,
and other Stratix and Cyclone V devices (I would remember that for Cyclone V
it is possible to get 320 Mh/s performance per chip @ 160 Mhz @ 6 W approx).
The same code on EP3SL150F780C4 gave highest clock 220 Mhz, and on
EP3SL150F780C3 gave highest clock 250 Mhz. It is exactly unrolled round calculation.
And clock is based on "Slow 110mV 85C Model Fmax Summary" so if some overvolt
practice done it would run a bit (probably like 10%) faster.
Fitter status: Successful - Sun Jul 01 10:22:07 2012
Quartis II 64-Bit Version: 11.1 Build 173 11/01/2011 SJ Full Version
Revision Name: ALAdder
Top-level Entity Name: sha_s4_test
Family: Stratix III
Logic utilization: 86%
Combinational ALUTs: 86,417 / 113,600 (76%)
Memory ALUTs: 0 / 56,800 (0%)
Dedicated logic registers: 85,360 / 113,600 (75%)
Total registers: 85360
Total pins: 7/488 (1%)
Total block memory bits: 198,080 / 5,630,976 (4%)
As you see - one of improvements for Stratix / Cyclone design is to use RAMs...I use them with altsyncram primitive as
it gives me ability to implement shift registers with read_during_write_mode_mixed_ports => "DONT_CARE" mode, which
is important or otherwise memory will be slower (consider this as a HINT to BFL - to not use altshift_taps or automated synthesis of shift registers).
PowerPlay gives estimation of about 26214.26 mW and average toggle rate 249.704 millions transitions / sec (for 250 Mhz clock setup).
So what this means for BFL single device:
1) Without any overdriving practices device with C4 could deliver 220*4 = 880 Mh/s and with C3 250*4 = 1000 Mh/s
2) Power consumption would be (let's assume that PowerPlay lied and it is 30 W @ 250 Mhz @ 1.1 V) ~ 50W for C4 chip and 65 W for C3 chip;
3) in case of overdriving chip to 1.2 V C4 chip would deliver about 960 Mh/s and C3 chip about 1090 Mh/s, power consumption would be about 60 W and 76W correspondingly;
But because they already have about 80W power consumption, that leads me to conclusion, that C3 chip is used, but top-level logics and round maths is inferior and suboptimal. As basically you could get _lower_ performance just by doing operations in wrong order.
I've already tried to contact BFL in PM - regarding my development and ASIC future deployments, but no answer. Maybe this topic would add some heat.
But I have few questions here:
1. What chip speed grade exactly there ?
2. What voltage is used there (this can be probably measured by many owners of BFL singles) ? Is it standard 1.1 V or something like 1.2 V ?
It would be nice, if BFL would do full disclosure here about their previous product art, as their ASIC initiative seems to make them obsolete already.
Also 2 BFL - if you use same top-level for your ASIC development, don't you think you may end up with product "obsolete-on-arrival" ? Because this is not "custom IC cell design", this is just math, and not complex part of it - as this "test sha_s4_test.vhd" is actually pretty small file mostly using RTL-style code and not using low-level primitives etc. I've run fitter and synthesis without optimization settings. But - if you can't deliver best in top-level optimization, why would I believe, that you would in low-level, where things are more complex and you'd likely have to do full-wave simulations of your custom cells, and still have several re-spins ? Or in layout - because layout could be done in a way by automatic tools, that will destroy all harvested performance. (2 DiabloD3 and those who think that custom IC is always that difficult - NO - I've studied more - if you don't try to harvest performance, and would do reliable cell and don't care about performance much, it is _likely_ that your cell would work, you may even implement cell that would work on different fabs... "portable" one but it would be quite inefficient... actually custom IC may be even cheaper - because basically it is just the same as PCB but on silicon, so if you do design in a way where you accept wide tolerance of your transistors - you get good manufacturability, good portability but poor performance... surprising, but tools for custom IC design without extensive modelling are actually cheaper - say for example www.tannereda.com
- pretty nice tool to go from schematic to layout of chip - you even can get evaluation there for free and try to layout several transistors yourself, testing their performance in SPICE... that would be however likely far from specs you get from silicon... ).
I would be sorry, if you have worked on ASIC for quite long period and already have layout, because it seems that you'll have to re-do it if my estimations about your top-level is right.
PS. 2 BFL fans - please do not turn BFL into religion :-) There's mining speculation subforum for exactly that purpose.
PPS. As you see _only_ 75% of chip is used... In extra space there could be fitted approx 2 times bigger serial hashers as addition, as design using automated placement would fit into 90% of a chip. Leaving about 10%. In these 10% it is possible to place about 8 serial hashers running at same or faster (LIKELY FASTER) clock. Each hasher outputting additional 3.5 Mh/s - so +25-28 Mh/s per chip and +50-56 Mh/s per BFL single. Setting best theoretical output as 1140 Mh/s per BFL single.