alh
Legendary
Offline
Activity: 1849
Merit: 1052
|
|
January 08, 2018, 11:26:38 PM |
|
That's the whole point of doing the transistor design as open hardware. You eliminate the biggest barrier to entry by putting the transistor layout into the public domain. There still would only be a handful of folks that would go to masks at 10 nm or lower nodes, but they would be forced to keep pricing competitive because there are dozens of entities capable of entering the market.
I am sure you could find faculty that would find this a worthwhile project, and you could easily fund a few spins at an 8 inch 64 nm fab for under $1M. That's 50 BTC. I paid that much as a bounty to fix my FPGA supplier's garbage code back in 2012!
I am not a semiconductor guy, but just my discussions with folks that are suggest that the "design rules" and associated tool chains change on a regular basis as the node size shrinks. What that means to me is that the "rules and tools" for a 40nm process don't work for a 28nm process which don't work at 16nm. I think that means that your idea of developing with "cheap" process and then shrinking down won't work since the Fab for 16nm can't use "masks" from a 40nm process. Voltages are all wrong, leakage current and a whole host of things that don't manifest at 40nm become hugely important at 16nm. I expect the testing and packaging also changes.
|
|
|
|
Entropy-uc
|
|
January 09, 2018, 06:42:39 AM Last edit: March 27, 2018, 01:44:13 AM by frodocooper |
|
I am not a semiconductor guy, but just my discussions with folks that are suggest that the "design rules" and associated tool chains change on a regular basis as the node size shrinks. What that means to me is that the "rules and tools" for a 40nm process don't work for a 28nm process which don't work at 16nm. I think that means that your idea of developing with "cheap" process and then shrinking down won't work since the Fab for 16nm can't use "masks" from a 40nm process. Voltages are all wrong, leakage current and a whole host of things that don't manifest at 40nm become hugely important at 16nm. I expect the testing and packaging also changes.
I know that very well. The arrangement of transistor gates required to implement a double SHA-256 hash would not change. The implement would change for the target process node and Fab. By demonstrating that your transistor level design is correct, the risk is dramatically reduced. It's not zero because, as you say, the implementation at each node would require a unique design layout. Would it really move the need on the costs to implement a bitcoin hash chip? I can't say for certain. I can tell you that a surface discussion with somebody exposed to semi design won't give you a valid answer because they are thinking in terms of tool chains and standard cells that decouple you from the transistor level by several layers. It simply isn't industry standard practice. But results delivered by Bitfury make it clear it's the only way to be competitive in the crypto mining space. Moderator's note: This post was edited by frodocooper to remove a nested quote.
|
|
|
|
alh
Legendary
Offline
Activity: 1849
Merit: 1052
|
|
January 09, 2018, 08:15:00 AM |
|
Are you by any chance a marketing guy?
No. Ph.D in engineering. I worked in process R&D for Intel for over a decade before I escaped. Given your experience and education, why would you start asking questions here? I am lost as to what you are seeking from a truly ransom collection of folks here.....
|
|
|
|
majlkcze
Newbie
Offline
Activity: 25
Merit: 0
|
|
January 09, 2018, 08:31:39 AM Last edit: March 27, 2018, 01:43:07 AM by frodocooper |
|
Interesting topic guys. I´m still thinking about things you are talking about there, and this should be possible. I went through the whole process from FPGA to ASIC as a microelectronics student, from design to fabrication, yes I was personally in the clean room holding the wafers.
The saddest thing is that it was 5+ years back and the crypto was not so well known and I had no clue what can be done in this area.
Now I´m still on the same faculty as a PhD. student, I think I can pull some triggers and be helpful in this area. If a group of members decide to try something, count me in.
Moderator's note: This post was edited by frodocooper to remove an unnecessary quote.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
January 10, 2018, 01:12:08 AM |
|
The problem is you won't find a design house willing to work that way. They have their tool sets and their work flows and they aren't going to diverge from it. So you will need to buy your own set of design tools and find a team of borderline Asperger's cases to do the transistor design.
This is where I disagree, people are just looking for a wrong kind of design house. They need to look for designers experienced and interested in mixed-signal and power-electronics designs. This is significantly different than the predominant industry practice in digital logic design. Technically there are 3 main points where Bitcoin mining chip differs from the typical modern digital IC: 1) SHA-256D is practically a fully self-testing circuit 2) SHA-256D has very high signal toggle rate (0.5, only 3dB below the theoretical maximum of 1.0 for a ring oscillator) 3) there are practically no external design requirements (like timing closure) the chip is 100% limited to either thermal/power (when over-clocking) or self-switching-noise (when under-volting) The standard tool chains used in digital logic design fail to produce efficient designs with the above requirements: 1) heuristic layout optimization algorithms fail to converge on a design where each bit of the output depends on each bit of input, so the designers force round unrolling to achieve convergence 2) the methodology is mostly designed for timing-closure or test-driven-design, when this is completely not an problem here 3) the approximations made by the toolchains are very inaccurate in the interesting problem space (very high toggle rate and no timing demand whatsoever) The end result is that standard tools produce designs that are way too conservative in terms of individual reliability of gates and flip-flops: they are way too reliable at the local-logic level and then trade this off for noise tolerance on the very long and high fan-out interconnections. Somebody should really fund a Professor to do the design work under an open hardware license. One the transistor design for SHA256 is done you just have to bring that into the fab's design tools and optimize for placement. Conductor losses are becoming dominant at these process nodes so that is where the biggest optimizations will be found.
Well, I haven't spoken with a Professor recently, but I did in the past. From that experience I can surmise that work on a Bitcoin miner could be a career-limiting move for a scientist in the current prevailing climate at the engineering schools. If you are going to ask around here are the two good questions to ask: 1) why nobody considers plain old serial adders/subtractors for the most common operation in SHA-256D? One can add two 32 bit numbers with a few XOR gates and a D flip-flop with absolute minimum of power spent. It will just take 32 clocks, but who cares? Why such an obsession with parallel adders and complex carry-look-ahead logic when there's no real timing constraint? 2) why nobody considers the old trick of using differential logic (like ECL vs TTL) when the signal toggle rate is at 50%? The complementary logic gives great power saving provided that the toggle rate is much less, the closer to zero the better. This is of no benefit here whatsoever. So if you guys are looking for either commercial design houses or semiconductor design professors avoid the mainstream. In addition to the two categories above I could also suggest asking about past experience with GaAs or other exotic processes, which exercised less explored corners of the design mind-scape. Remember, BitFury's original design may have been done on a kitchen table or in a garage, but they did not unroll and decisively beat all the experienced CAD-monkeys despite using much older fabrication process.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
January 10, 2018, 08:11:53 PM |
|
I'd like to learn a little more about this transistor level implementation, I'm having a hard time picturing what could reasonably be exploded or minimized in the hash core. Xor? It's just flops and wiring otherwise, I would be surprised if the flop was exploded, but maybe - if you have any links to check out, it would be an interesting read.
Here's the example of what can be optimized with the transistor-level knowledge. SHA-256 has 64 rounds that when unrolled have values that once computed have to be used in 16 different places (fanout of 16). For this example lets simplify and assume that there are only 2 inputs and 6 outputs. a<= b<= c<= d<= e<= f<= x + y;
This can be optimized to: a<= b<= c<= x + y; d<= e<= f<= x + y; The optimization is that the same value is computed twice, but in different physical locations on a die and the signal needs shorter routes from the source to the destination. Here's more in-depth explanation: https://en.wikipedia.org/wiki/FO4Note that the above optimization is the opposite of the ASICBOOST "optimization". We know that recent Bitmain chips are have capability to work both in regular way and boosted with ASICBOOST (with theoretical maximum of about 25% savings). We also know that when used in the boosted configuration they need to be clocked much lower and have lower overall performance (and probably lower yield of chips that could work in boosted modes). If Bitmain was capable of accurately simulating their chips they wouldn't waste their resources on that exercise because 25% is lower than the normal manufacturing tolerances on the process nodes they were using. Transistor-level simulation is nowadays more accurate than the manufacturing variance and one could actually simulate the performance at the various process corners. From the above we can deduct that they don't have any sort of transistor-level design, they just use standard cells and sandbagging the design with wide safety margins. That is the same thing that KnCminer did years ago. The other possibility is that Bitmain did implement their chips dual-capable (both boosted and un-boosted) for some non-technical, political or personal reasons. But that would mean that their chips are even less optimized than they could be without wasting space on the unused boosting logic.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
January 10, 2018, 08:31:13 PM |
|
Your idea about starting at a larger node is a good one, you would certainly want to debug on a cheap process.
There's nothing to debug at the transistor level that is process-independent. In fact, even the transistor model changed from BSIM3 to BSIM4-family when you move from cheap to expensive processes. The general topology of the models is already well known and open sourced: http://bsim.berkeley.edu/models/What is secret? The parameter values of those models. And even if you use MOSIS/Europractice or similar program you won't be able to publish those secret values. Without those you can't optimize in any sensible way beyond "sandbag the hell out of it and keep your fingers crossed". KnC did that already.
|
|
|
|
QuintLeo
Legendary
Offline
Activity: 1498
Merit: 1030
|
|
January 12, 2018, 09:57:16 PM |
|
Global Foundries operates on a standard contract fab model so it's not really surprising that they built the BFL devices.
Partly true - they do have some extensive contracts with IBM and AMD dating back to the "fab spinoff" days and amended/updated every so often that lock up a lot of their capacity if IBM or AMD wants that capacity. The contract fab model applies to whatever is "left over".
|
I'm no longer legendary just in my own mind! Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin) 1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)
|
|
|
NODEhaven
Jr. Member
Offline
Activity: 58
Merit: 12
|
|
March 09, 2018, 11:46:27 PM |
|
Your idea about starting at a larger node is a good one, you would certainly want to debug on a cheap process.
There's nothing to debug at the transistor level that is process-independent. In fact, even the transistor model changed from BSIM3 to BSIM4-family when you move from cheap to expensive processes. The general topology of the models is already well known and open sourced: http://bsim.berkeley.edu/models/What is secret? The parameter values of those models. And even if you use MOSIS/Europractice or similar program you won't be able to publish those secret values. Without those you can't optimize in any sensible way beyond "sandbag the hell out of it and keep your fingers crossed". KnC did that already. This is by far one of the better threads I have come across on Bitcointalk. If its not too much, could you describe a little on how KnC "sandbagged" the design and why didn't they use Europractice?
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
March 16, 2018, 04:17:53 PM |
|
This is by far one of the better threads I have come across on Bitcointalk.
If its not too much, could you describe a little on how KnC "sandbagged" the design and why didn't they use Europractice?
"sandbagging" means that they used quite large factors of safety in their design ( https://en.wikipedia.org/wiki/Factor_of_safety describes is for mechanical/structural designs ). E.g. if the design tool came up with N um wide power rail they actually drawn the power rail as S*N where S > 1 . If their simulation computed that the maximum clock speed will be F MHz, they used D*F (where D < 1) in their published specification. One of their executives enumerated their multiple layers of safety margins in the video they published upon initial release of their miners. Maybe somebody archived it somewhere in the KnC thread? Europractice access is limited to educational/research/non-profit institutions. KnC from the beginning was a funded for-profit corporation. On the other hand Bitfury (person) initially developed his chip with cooperation from some Polish research institute before funding the Bitfury (corporation). I keep mentioning Europractice/Mosis in the thread like this because it is an obvious and effective way of saving money in the initial stages of a design. Lots of folks keep mentioning multi-million dollar initial costs of developing the mining ASICs. But this is quite obviously not true if somebody knows how to use the educational discounts and how to deal with associated limitations on merchantability.
|
|
|
|
NODEhaven
Jr. Member
Offline
Activity: 58
Merit: 12
|
|
March 21, 2018, 02:01:35 AM Last edit: March 27, 2018, 01:34:12 AM by frodocooper |
|
...
How does the overt ASICboost that Halong is implementing effect the logic on the chip? Moderator's note: This post was edited by frodocooper to trim the quote from 2112.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
March 21, 2018, 05:12:56 PM |
|
How does the overt ASICboost that Halong is implementing effect the logic on the chip?
I don't think that there's any non-bullshit information available publicly about Halong chips, so I'll refrain from making comments.
|
|
|
|
NODEhaven
Jr. Member
Offline
Activity: 58
Merit: 12
|
|
March 24, 2018, 07:10:53 PM Last edit: March 27, 2018, 01:35:22 AM by frodocooper |
|
"sandbagging" means that they used quite large factors of safety in their design ( https://en.wikipedia.org/wiki/Factor_of_safety describes is for mechanical/structural designs ). E.g. if the design tool came up with N um wide power rail they actually drawn the power rail as S*N where S > 1 . If their simulation computed that the maximum clock speed will be F MHz, they used D*F (where D < 1) in their published specification. One of their executives enumerated their multiple layers of safety margins in the video they published upon initial release of their miners. Maybe somebody archived it somewhere in the KnC thread? Europractice access is limited to educational/research/non-profit institutions. KnC from the beginning was a funded for-profit corporation. On the other hand Bitfury (person) initially developed his chip with cooperation from some Polish research institute before funding the Bitfury (corporation). I keep mentioning Europractice/Mosis in the thread like this because it is an obvious and effective way of saving money in the initial stages of a design. Lots of folks keep mentioning multi-million dollar initial costs of developing the mining ASICs. But this is quite obviously not true if somebody knows how to use the educational discounts and how to deal with associated limitations on merchantability. I will check the KncMiner thread and post a link if I can find it. Also, That's pretty genius of Bitfury. I know of a few professors at University of Houston that are interested in developing some FPGAs for crypto-currency. Also, in college I took full advantage of those type of licenses. Right now I looking into on-chip temperature sensors and voltage regulation to use in a feedback loop that may require outside IP if feasible which would require a license. Those licenses may forgo the ability to use the non-profit approach. Something I pulled up after a quick search. It has digital output. Not sure if that is an issue and how to calibrate it. https://www.design-reuse.com/sip/temperature-sensor-series-6-with-digital-output-tsmc-7nm-ff-high-accuracy-thermal-sensing-for-reliability-and-optimisation-ip-43229/?login=1 Moderator's note: This post was edited by frodocooper to remove a nested quote.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
March 25, 2018, 12:37:12 AM |
|
Don't make a mistake of putting nontrivial control logic onto the same chip as the mining circuitry. In case of failure you won't be able to distinguish between the real fault or bogus fault induced by the noise and/or heat from mining logic. By definition the mining logic has to work at the edge of starvation or hyperthermia death, otherwise it is operating far from optimal. Helveticoin did something like you are thinking (including an on-die ARM controller) and it was completely non-competitive. It had to be severely underclocked to maintain the reliability of the controlling SoC. Spondoolies included on-die power-on-self-test and then had to create software workarounds for mining engines that fail the POST but operate correctly after a warm-up. Some desperadoes resorted to preheating their miners with a hair dryer. You'll be much better off with just temperature-sensing diodes or averaging multiple low-accuracy temperature sensors located in far-away corners of the die.
|
|
|
|
|
HyperMega
|
|
March 26, 2018, 03:39:46 PM Last edit: March 27, 2018, 01:40:04 AM by frodocooper |
|
How does the overt ASICboost that Halong is implementing effect the logic on the chip?
Please have a look at page 8 of the original ASICboost white paper: https://arxiv.org/ftp/arxiv/papers/1604/1604.00575.pdfThere is a Duo-Core ASICboost implementation shown. In case you would operate such a Duo-Core in a non-ASICboost mode (at the same clock frequency), you would run at 50% of the ASICboost performance, because only one of the two cores can operate in non-ASICboost mode. Ck said in another thread, that the Halong miner is at 25% of its performance in a non-ASICboost mode. Because of that I would assume, that they implemented a Quad-Core, which requires about 18.75% less silicon area (leakage power)/logic toggling (dynamic power) compared to 4 non-ASICboost cores. Moderator's note: This post was edited by frodocooper to trim the quote from NODEhaven.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
March 26, 2018, 05:27:02 PM |
|
Please have a look at page 8 of the original ASICboost white paper: https://arxiv.org/ftp/arxiv/papers/1604/1604.00575.pdfCk said in another thread, that the Halong miner is at 25% of its performance in a non-ASICboost mode. Because of that I would assume, that they implemented a Quad-Core, which requires about 18.75% less silicon area (leakage power)/logic toggling (dynamic power) compared to 4 non-ASICboost cores. All the numbers in that paper are theoretical values assuming infinite speed of light and counting of ideal logic gates with no parasitic impedances, infinite input impedance and zero output impedance. That has no bearing on any actual implementation in any realistic logic circuit technology. In particular even non-ASIC-boosted but unrolled SHA256 has same values used in 16 different places. This implies https://en.wikipedia.org/wiki/Fan-out of 16 when nearly all CMOS processes are optimized for fan-out of 4 https://en.wikipedia.org/wiki/FO4 . The FO4 argument probably explains why that chip is built with fixed 4-way ASICboost.
|
|
|
|
HyperMega
|
|
March 26, 2018, 07:15:38 PM Last edit: March 27, 2018, 01:41:57 AM by frodocooper |
|
All the numbers in that paper are theoretical values assuming infinite speed of light and counting of ideal logic gates with no parasitic impedances, infinite input impedance and zero output impedance. That has no bearing on any actual implementation in any realistic logic circuit technology. In particular even non-ASIC-boosted but unrolled SHA256 has same values used in 16 different places. This implies https://en.wikipedia.org/wiki/Fan-out of 16 when nearly all CMOS processes are optimized for fan-out of 4 https://en.wikipedia.org/wiki/FO4 . The FO4 argument probably explains why that chip is built with fixed 4-way ASICboost. These numbers are not based on completely ideal assumptions. They are based on the fact that the part of the pipeline, which outputs could be reused by other cores, counts for about 25% of the overall core logic of a single core. Ok, you are right, the FO/load cap of the reused bits is increased by feeding multiple cores. But the reused outputs are only 32 bits in contrast to a 512 bit wide pipeline without increased FO, implemented only once. So the gain of an ASICboost duo-core in terms of power efficiency will be a bit less than 12.5%, but not much. Moderator's note: This post was edited by frodocooper to remove a nested quote.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
March 26, 2018, 08:13:09 PM |
|
These numbers are not based on completely ideal assumptions. They are based on the fact that the part of the pipeline, which outputs could be reused by other cores, counts for about 25% of the overall core logic of a single core.
Ok, you are right, the FO/load cap of the reused bits is increased by feeding multiple cores. But the reused outputs are only 32 bits in contrast to a 512 bit wide pipeline without increased FO, implemented only once.
I haven't read the full patent application, but I understand how they are written with a goal of withstanding claim/counter-claim adversarial legal system in the USA and other anglophone countries. So I can confidently repeat: you are wrong, these numbers intentionally use idealized, abstract algebraic models to make a strong patent application. The whitepaper is just a marketing brief for the patent. This isn't a scientific report in the applied science field. In the next paragraph you use the term "512-bit wide pipeline". This is just such a nice marketing speak. SHA256 is actually a 16-stage 32-bit wide shift register with some fancy feedback terms. The re-invention of it as 16*32=512 bit vector pipeline is nothing more than a workaround in for the bugs/design flaws in the front-end Verilog tools used preferably in the West Coast of the USA. If the design was done in VHDL (as preferred by East Coast USA boutiques) there would be no need for that trick of making 32-bit slices out of 512-bit vector. No matter which front-end was used the actual physical layout is very far from the neatness associated with the word "pipeline" and how e.g. AMD/Intel use it in theirs marketing literature and die photos. The physical layout of such designed unrolled mining engine very much resembles the snake pit like one used in my avatar. That happens because the heuristic layout optimization tools cannot find any useful gradient to optimize for, fail to converge or converge extremely slowly resulting with semi-random rats nest of long traces. So the gain of an ASICboost duo-core in terms of power efficiency will be a bit less than 12.5%, but not much.
I cut this paragraph into a separate quote because it is a beautiful sample of USDA prime marketing baloney. Firstly duo-core was just a sample on the whitepaper, the Halong's implementation is quad-core. So it is 18.75% not 12.5%. Secondly, you use values of bit much less than 2. Such a nice English creative writing trick. How do you values of "bit" compare with manufacturing tolerances which are about +/-20%? Thirdly, it not about just (A) reduction of power use. You neglected to mention: B) lower clock speed due to need to keep nearly four times larger area that needs to be kept in lockstep; C) lower yield because the area of mutually dependent logic is increased nearly four-fold. It is quite an achievement in marketing to squeeze 3-way deception into a single sentence. You must be a professional. Finally, whatever one can say about Bitmain's chip that is ASIC-boost capable, at least it is somewhat honest in implementing switchable levels of ASIC-boost. One could actually measure the actual gains or loses from various levels of boosting and compare them with the table of theoretical values. It isn't as perfect an experiment as designing separate chips for each level of boosting, but a better scientific compromise. All I can guess about Halong's chip is that it's design was worked out as some sort of political compromise or attack/defense strategy. I'm definitely not up to speed on the factions currently involved in the Bitcoin internecine warfare.
|
|
|
|
HyperMega
|
|
March 27, 2018, 03:43:39 PM |
|
It is quite an achievement in marketing to squeeze 3-way deception into a single sentence. You must be a professional.
Finally, whatever one can say about Bitmain's chip that is ASIC-boost capable, at least it is somewhat honest in implementing switchable levels of ASIC-boost. One could actually measure the actual gains or loses from various levels of boosting and compare them with the table of theoretical values. It isn't as perfect an experiment as designing separate chips for each level of boosting, but a better scientific compromise.
All I can guess about Halong's chip is that it's design was worked out as some sort of political compromise or attack/defense strategy. I'm definitely not up to speed on the factions currently involved in the Bitcoin internecine warfare.
A professional marketing guy? No, I’m not that kind of professional. It always takes me a while to extract the useful information from your posts, but believe me, I finally agree with you, sometimes. Yes, having the ability to switch ASICboost on/off (as Bitmain did), give you a chance to compare the two modes. It would even be possible to do a complete power shut-off of the unused backup logic in ASICboost mode, to avoid the leakage of these logic parts. But it still consumes silicon area, which increases your production costs in terms of $/GH. Halong has chosen a very aggressive way to implement ASICboost without any backup logic for a non-ASICboost mode. In this way they have enabled the full potential of ASICboost in terms of J/GH and $/GH. I wouldn’t dare something like that, without the support of parts of the community (e.g. Slush). The risk of falling down to only 25% of the maximum performance would be much to high, in case no pool would support rolling versions. So yes, I agree, it was “a sort of political compromise or attack/defense strategy”.
|
|
|
|
|