Bitcoin Forum
April 28, 2024, 12:55:34 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Warning: One or more bitcointalk.org users have reported that they strongly believe that the creator of this topic is a scammer. (Login to see the detailed trust ratings.) While the bitcointalk.org administration does not verify such claims, you should proceed with extreme caution.
Pages: « 1 2 3 4 5 6 7 [8] 9 10 11 12 13 »  All
  Print  
Author Topic: Block Erupter: Dedicated Mining ASIC Project (Open for Discussion)  (Read 58543 times)
firefop
Sr. Member
****
Offline Offline

Activity: 420
Merit: 250


View Profile
November 21, 2012, 08:14:13 PM
 #141

Yes, its just like unrolling loop iterations, but in hardware. You have timing problems because you have one clock pulse per hash and everything needs to arrive into their stages at the right time; if you have traces that are too long or too short then stuff doesn't function correctly or you need to waste more silicon trying to properly synchronize data.

It is much easier to just keep data exactly where it needs to be and iteratively process it.

I agree it's easier... but it isn't better. Which was of course, my entire point.

No. As bitfury demonstrated in practice, tiny hashing cores are superior to unrolled designs. He gets 300 Mh/s compared to 220-240 Mh/s for the competition, on an LX150. Some of the gains are attributable to overclocking and overvolting, but most come from its superior design.

No. You said it yourself "on an lx150" - the correct way to do this would be to use dozens or hundreds of chips and have it process in a single stage... impractical on an FPGA, but perfectly doable for an ASIC.


1714265734
Hero Member
*
Offline Offline

Posts: 1714265734

View Profile Personal Message (Offline)

Ignore
1714265734
Reply with quote  #2

1714265734
Report to moderator
1714265734
Hero Member
*
Offline Offline

Posts: 1714265734

View Profile Personal Message (Offline)

Ignore
1714265734
Reply with quote  #2

1714265734
Report to moderator
1714265734
Hero Member
*
Offline Offline

Posts: 1714265734

View Profile Personal Message (Offline)

Ignore
1714265734
Reply with quote  #2

1714265734
Report to moderator
"Governments are good at cutting off the heads of a centrally controlled networks like Napster, but pure P2P networks like Gnutella and Tor seem to be holding their own." -- Satoshi
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714265734
Hero Member
*
Offline Offline

Posts: 1714265734

View Profile Personal Message (Offline)

Ignore
1714265734
Reply with quote  #2

1714265734
Report to moderator
1714265734
Hero Member
*
Offline Offline

Posts: 1714265734

View Profile Personal Message (Offline)

Ignore
1714265734
Reply with quote  #2

1714265734
Report to moderator
1714265734
Hero Member
*
Offline Offline

Posts: 1714265734

View Profile Personal Message (Offline)

Ignore
1714265734
Reply with quote  #2

1714265734
Report to moderator
makomk
Hero Member
*****
Offline Offline

Activity: 686
Merit: 564


View Profile
November 21, 2012, 09:31:52 PM
 #142

No. As bitfury demonstrated in practice, tiny hashing cores are superior to unrolled designs. He gets 300 Mh/s compared to 220-240 Mh/s for the competition, on an LX150. Some of the gains are attributable to overclocking and overvolting, but most come from its superior design.
That's because the routing really sucks on Spartan-6 FPGAs. I'm not convinced an ASIC would have the same problem.

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
mrb
Legendary
*
Offline Offline

Activity: 1512
Merit: 1027


View Profile WWW
November 21, 2012, 10:24:50 PM
 #143

firefop, I think you are confusing 2 aspects which are orthogonal to each other: the die size (large or small) is mostly irrelevant to the type of design (tiny hashing cores or large unrolled cores).

For one, even a large unrolled core core would fit in a chip smaller than BFL's SC (56.25mm2 at 65nm). So no matter what design you choose (unrolled or not) you can put as many cores as you want to target whatever die area you want.

You guys claim that routing is not an issue on ASIC, but this is incorrect too. It is less of an issue compared to FPGAs, but it still is, especially for SHA-256 where you have 256 bits of state to manipulate. If you are familiar with the algorithm, you should know that this state (A..H) is rotated in the main loop, so the 256 bits are used all over the place, and create routing challenges. This is less of an issue with a non-unrolled design, as the state can be kept close to the tiny core.
DiabloD3
Legendary
*
Offline Offline

Activity: 1162
Merit: 1000


DiabloMiner author


View Profile WWW
November 21, 2012, 10:28:37 PM
 #144

No. As bitfury demonstrated in practice, tiny hashing cores are superior to unrolled designs. He gets 300 Mh/s compared to 220-240 Mh/s for the competition, on an LX150. Some of the gains are attributable to overclocking and overvolting, but most come from its superior design.
That's because the routing really sucks on Spartan-6 FPGAs. I'm not convinced an ASIC would have the same problem.

It would, but because heat and (lack of) voltage (to prevent more heat, causing stability issues) would become much more apparent on more complex designs. Spartan 6s just make the problem seem a magnitude or two worse than it is.

punin
Hero Member
*****
Offline Offline

Activity: 560
Merit: 500


View Profile WWW
November 21, 2012, 10:31:08 PM
 #145

Yes, its just like unrolling loop iterations, but in hardware. You have timing problems because you have one clock pulse per hash and everything needs to arrive into their stages at the right time; if you have traces that are too long or too short then stuff doesn't function correctly or you need to waste more silicon trying to properly synchronize data.

It is much easier to just keep data exactly where it needs to be and iteratively process it.

I agree it's easier... but it isn't better. Which was of course, my entire point.

No. As bitfury demonstrated in practice, tiny hashing cores are superior to unrolled designs. He gets 300 Mh/s compared to 220-240 Mh/s for the competition, on an LX150. Some of the gains are attributable to overclocking and overvolting, but most come from its superior design.

I can confirm that Bitfury's bitstream runs at ~305 Mhash/s on normal voltage range as per Xilinx specs.

Head of Product Development
Bitfury Group
www.bitfury.com
2112
Legendary
*
Offline Offline

Activity: 2128
Merit: 1065



View Profile
November 21, 2012, 10:40:50 PM
Last edit: November 22, 2012, 12:10:38 AM by 2112
 #146

That's because the routing really sucks on Spartan-6 FPGAs. I'm not convinced an ASIC would have the same problem.
The argument for the sea-of-hashes design can be derived from the classic analysis made by Mead & Conway and contemporaries.

Consider a circular sea-of-gates big enough to implement many copies of the Bitcoin double-SHA256.

SHA256 is basically a pair of 32-bit wide shift registers with some somewhat convoluted feedback logic. The feedback logic is active (doing the actual computation) whereas D-type flip-flops and connections are passive (just shuffle the signal around). Let X be the average connection length in this design.

Now think about unrolling the above design over a plane. You'll need the values of the feedback terms from the neighbouring cells w-2,w-7,w-15,w-16. Your average connection length rises (2+7+15+16)/4 times or about 10*X . So the passive losses in the interconnect rose about an order of magnitude. You could compensate for this by removing some D-type flip-flop stages and slowing down the clock. By definition you can't really remove the active logic gates that compute the feedback terms. As an extreme you can have a purely combinatorial SHA-256 hasher doing everything in single cycle of a rather slow clock.

I'm not aware of any neat analytical solution for the above optimization problems. But the numerical experiments show that racing the combinatorial signals over vast expanses of silicon is a losing game. Speed of light in MOS transmission line is much less than the speed of light in vacuum. This analysis can be made without actual place-and-route, it is sufficient to have an estimated distribution of the inter-connection lenghts that create planar graph for the logic. I don't recall if sphere-of-gates instead of sea-of-gates is a win, but sphere-of-gates has an obvious termal problem even if we could somehow manufacture it.

In summary: wafer-scale integration was attempted several times in the past without an obvious win. Check out the history before you follow that trail.

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0
kano
Legendary
*
Offline Offline

Activity: 4466
Merit: 1800


Linux since 1997 RedHat 4


View Profile
November 22, 2012, 02:58:00 AM
 #147

So I see what's going on in here Smiley

My post in the shareholder thread that seems should be in here:
https://bitcointalk.org/index.php?topic=99497.msg1350329#msg1350329

Pool: https://kano.is - low 0.5% fee PPLNS 3 Days - Most reliable Solo with ONLY 0.5% fee   Bitcointalk thread: Forum
Discord support invite at https://kano.is/ Majority developer of the ckpool code - k for kano
The ONLY active original developer of cgminer. Original master git: https://github.com/kanoi/cgminer
DiabloD3
Legendary
*
Offline Offline

Activity: 1162
Merit: 1000


DiabloMiner author


View Profile WWW
November 23, 2012, 12:59:43 PM
 #148

Hey friedcat, if you give me one of these, I'll make sure DiabloMiner supports it.

firefop
Sr. Member
****
Offline Offline

Activity: 420
Merit: 250


View Profile
November 24, 2012, 06:04:30 AM
 #149

The argument for the sea-of-hashes design can be derived from the classic analysis made by Mead & Conway and contemporaries.

Consider a circular sea-of-gates big enough to implement many copies of the Bitcoin double-SHA256.

SHA256 is basically a pair of 32-bit wide shift registers seventy-two 32-bit registers


Fixed that first part for you... but just wasn't up to trying to edit the rest for conceptual logic failures.

Software optimization isn't the same as hardware optimization. ASIC design should not be thought of as "lets make a chip that can do this calculation at 2000mhz over and over and over..." that's counter-intuitive. You've locked yourself into thinking in terms of GPU design which need not apply to other processes. The reason GPUs (and yes, CPUs too) are designed this way is because they are multi-function chip. There's operations they know how to do, and they process things according to instructions. That's fine for generalized applications. In the case of GPU you've got a hard limit / goal of producing a video frame every so many fractions of a second... sha2 just doesn't need that level of coordination. You aren't having to work for a variety of instructions - it's a single process that doesn't change.

Besides which we're not actually talking about that much data. sha2... we only need to work with 512 bits at a time. At the very least we had better be unrolling the chunk processing for so that it isn't looping... that's hardware design 101.



kano
Legendary
*
Offline Offline

Activity: 4466
Merit: 1800


Linux since 1997 RedHat 4


View Profile
November 24, 2012, 07:27:31 AM
Last edit: November 24, 2012, 08:06:00 AM by kano
 #150

Oh yeah - and make sure the MCU guy stops it from producing a serial device on Windows
(doesn't matter on linux, but that will stop it on linux also of course)

All code with ASIC should be using USB direct not serial-USB
And having the serial-USB can cause problems on windows (and usually means a manual driver fix)

I've been screwing around with this for the last few weeks on an MMQ-FPGA converting it from serial-USB to USB
only to find all the windows problems were driver related - not my code.
Lucky I've had access in IRC to the guy who does libusb, to help me sort it out Smiley

My reason for doing this was to prepare for the ASIC devices from each of the companies - and I'm glad I did do it in advance - coz the problems have been rather annoying.
Of course there will be other issues when dealing with ASIC, but of course I can't do anything about all of them until I have the devices.

I think (though not 100% sure) the serial-USB device's existence is decided by the firmware so it can be fixed after the fact anyway?

Pool: https://kano.is - low 0.5% fee PPLNS 3 Days - Most reliable Solo with ONLY 0.5% fee   Bitcointalk thread: Forum
Discord support invite at https://kano.is/ Majority developer of the ckpool code - k for kano
The ONLY active original developer of cgminer. Original master git: https://github.com/kanoi/cgminer
2112
Legendary
*
Offline Offline

Activity: 2128
Merit: 1065



View Profile
November 24, 2012, 10:54:18 AM
 #151

seventy-two 32-bit registers
Fixed that first part for you... but just wasn't up to trying to edit the rest for conceptual logic failures.
The 72 misconception is really getting boring.

FIPS-180-2 defines SHA-256 in terms of two arrays of 32-bit words:
H[8] and W[64]. 8+64=72. Yet a quick comparison with SHA-1 shows that the same "alternative implementation" can be used for SHA-256.

In case of SHA-1 the "original implementation" is H[5] and W[80]; while "alternative implementation" is H[5] and W[16]. Thus: 85 vs. 21.

In case of SHA-256 we have 72 vs. 24 (H[8] and W[16]).

The further observation is that the "arrays" or "circular queues" in the FIPS-180-2 definition aren't really accessed randomly or in any variable order. Therefore both H and W can be converted to 32-bit wide shift registers, but with unusual feedback functions.

The above is just for pure SHA-256, without any Bitcoin specific optimizations. At least two people claimed to be able to apply some unspecified optimizations to Bitcoin hash expressed as a binary function:

1) killerstorm
https://bitcointalk.org/index.php?topic=55888.0

2) Gareth (BitInstant)
https://bitcointalk.org/index.php?topic=10661.msg557579#msg557579

but nothing came out of it. By now pretty much everyone knows about the fact that one can shave 3 last rounds from the 2nd SHA-256 in Bitcoin: instead of looking for zero 32-bit word at the most significant position in H; take an advantage of the fact that H is a shift register and last 3 rounds simply shift the would-be-most-significant-word. So look for a negation of a specific constand value (0x3c6ef372?).
 
Software optimization isn't the same as hardware optimization. ASIC design should not be thought of as "lets make a chip that can do this calculation at 2000mhz over and over and over..." that's counter-intuitive. You've locked yourself into thinking in terms of GPU design which need not apply to other processes. The reason GPUs (and yes, CPUs too) are designed this way is because they are multi-function chip. There's operations they know how to do, and they process things according to instructions. That's fine for generalized applications. In the case of GPU you've got a hard limit / goal of producing a video frame every so many fractions of a second... sha2 just doesn't need that level of coordination. You aren't having to work for a variety of instructions - it's a single process that doesn't change.

Besides which we're not actually talking about that much data. sha2... we only need to work with 512 bits at a time. At the very least we had better be unrolling the chunk processing for so that it isn't looping... that's hardware design 101.

I'm puzzled by this part. I never mentioned CPU nor GPU. I tried to pattern my argument after the sort of arguments that were being made around 1980 during the http://en.wikipedia.org/wiki/Mead_%26_Conway_revolution . Perhaps you were mixing me with someone else?

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0
kano
Legendary
*
Offline Offline

Activity: 4466
Merit: 1800


Linux since 1997 RedHat 4


View Profile
November 24, 2012, 11:29:39 AM
 #152

1) Firstly, the double sha256 is a total of 3 rounds (with 64 steps each) - just the whole first round is constant across a full nonce range.
(commonly known as the midstate) that you only need to do once per nonce range.
2) Secondly, the first 3 steps of the 2nd round are constant across a full nonce range.
3) Thirdly, some of the W values are also constant across a full nonce range (easy to work out which)
4) Then finally, as you said, you don't need to complete the last 3 steps of the 3rd round.

In ASIC terms it would be risky to implement any of 2, 3 or 4
While you may gain a few % overall (6 out of 128 steps plus W optimistations) it also means you can only sha256 an exact BTC block header.
If BTC continues to use sha256 but makes any changes to the block header, then that wouldn't be a problem if none of steps 2, 3 or 4 were implemented in the silicon, since you could change the firmware to deal with a different header.

Pool: https://kano.is - low 0.5% fee PPLNS 3 Days - Most reliable Solo with ONLY 0.5% fee   Bitcointalk thread: Forum
Discord support invite at https://kano.is/ Majority developer of the ckpool code - k for kano
The ONLY active original developer of cgminer. Original master git: https://github.com/kanoi/cgminer
2112
Legendary
*
Offline Offline

Activity: 2128
Merit: 1065



View Profile
November 24, 2012, 12:55:17 PM
 #153

only to find all the windows problems were driver related - not my code.
I'm presuming that you had problems with the usbser.sys from Microsoft. Did you also had problems with the Prolific/FTDI drivers as well?

I think (though not 100% sure) the serial-USB device's existence is decided by the firmware so it can be fixed after the fact anyway?
On the LPC1343 like ModMiner yes. ngzhang used hard serial-USB chips (Prolific or FTDI) in his designs. Same with Enterpoint (FTDI).

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0
2112
Legendary
*
Offline Offline

Activity: 2128
Merit: 1065



View Profile
November 24, 2012, 01:06:17 PM
Last edit: November 24, 2012, 03:43:48 PM by 2112
 #154

1) Firstly, the double sha256 is a total of 3 rounds (with 64 steps each) - just the whole first round is constant across a full nonce range.
(commonly known as the midstate) that you only need to do once per nonce range.
2) Secondly, the first 3 steps of the 2nd round are constant across a full nonce range.
3) Thirdly, some of the W values are also constant across a full nonce range (easy to work out which)
4) Then finally, as you said, you don't need to complete the last 3 steps of the 3rd round.
Thanks. I'm quoting this because it is a very nice reference for the state-of-the-art GPU/FPGA optimizations. I remembered the 4) on your list the most because it most clearly shows the shift-register structure inherent to the SHA-256.

Edit: Note to self: Kano is swapping the standard terminology: step vs. round. Using standard terminology first SHA-256 hash in Bitcoin consists of 2 steps of 64 rounds each.

In ASIC terms it would be risky to implement any of 2, 3 or 4
While you may gain a few % overall (6 out of 128 steps plus W optimistations) it also means you can only sha256 an exact BTC block header.
If BTC continues to use sha256 but makes any changes to the block header, then that wouldn't be a problem if none of steps 2, 3 or 4 were implemented in the silicon, since you could change the firmware to deal with a different header.
At least for the chip discussed in this thread it appears that the block header structure is fixed:
0-31    writing midstate
32-43   writing data
44-47   reading nonce

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0
kano
Legendary
*
Offline Offline

Activity: 4466
Merit: 1800


Linux since 1997 RedHat 4


View Profile
November 24, 2012, 04:00:10 PM
 #155

...
Edit: Note to self: Kano is swapping the standard terminology: step vs. round. Using standard terminology first SHA-256 hash in Bitcoin consists of 2 steps of 64 rounds each.
...
Yep I mixed them around - oh well - fortunately it was obvious Cheesy
Thanks for correcting me.

Pool: https://kano.is - low 0.5% fee PPLNS 3 Days - Most reliable Solo with ONLY 0.5% fee   Bitcointalk thread: Forum
Discord support invite at https://kano.is/ Majority developer of the ckpool code - k for kano
The ONLY active original developer of cgminer. Original master git: https://github.com/kanoi/cgminer
kano
Legendary
*
Offline Offline

Activity: 4466
Merit: 1800


Linux since 1997 RedHat 4


View Profile
November 24, 2012, 04:57:23 PM
 #156

only to find all the windows problems were driver related - not my code.
I'm presuming that you had problems with the usbser.sys from Microsoft. Did you also had problems with the Prolific/FTDI drivers as well?

I think (though not 100% sure) the serial-USB device's existence is decided by the firmware so it can be fixed after the fact anyway?
On the LPC1343 like ModMiner yes. ngzhang used hard serial-USB chips (Prolific or FTDI) in his designs. Same with Enterpoint (FTDI).

I've only done the MMQ so far. Though the code is 'done' and in my git I haven't sent the pull to cgminer yet coz I've had to rebase it and there seems to be a bug in the ~2k lines of code I've changed that I still have to track down Smiley

Most likely I'll try Icarus/Prolific next and find more obstacles Tongue
(unless I get side tracked on something else ... an ASIC device shows up? Cheesy)

The Windows driver work around, in the MMQ case, was to use http://sourceforge.net/projects/libwdi/files/zadig/ to force it to use WinUSB (on WinXP)
So it's not insurmountable - but best if not every windows end user has to do that.

I've bitched about Serial-USB for a long time but only recently got around to doing this USB direct implementation

Firstly, I've only been messing with USB for a few weeks, so if anything below is way off - let me know.

Guessing at the early figures and considering around 50GH/s from a single device using 1 diff shares, and that USB has a standard transaction time of 0.125ms for 480MB/s USB 2.0, there already isn't a lot of space (and txn time is higher for 12MB/s, 1ms)

50GH/s is 11.6x1diff shares a second on average so just dealing with 6 transactions for that (send work, verify, request, receive, request, finished)
You're using up almost 1% of the USB for a single device (0.87%)
There's of course more overhead (device status e.g. temperature or anything else available to be monitored) but 6 is pretty much the minimum.
Add 10 of these devices ... and I've no idea how well USB works running at ~9% capacity (and how that affects other USB devices)
Also, if the device is idle for even 1ms waiting for work, that's more than 1% of it's work time lost
Thus why I'm certainly using USB direct for all ASIC USB devices - not Serial-USB and adding more overhead on top of it (and timing issues)

Down the track, once the first version ASIC devices have been optimised more for hashing performance (e.g. adding passing share difficulty to the firmware if not already ... or even going as far as implementing something like Stratum in the firmware) this will reduce the bandwidth usage of a single device, but then again it shouldn't be that far down the track when 50GH/s per USB device might increase substantially.

Pool: https://kano.is - low 0.5% fee PPLNS 3 Days - Most reliable Solo with ONLY 0.5% fee   Bitcointalk thread: Forum
Discord support invite at https://kano.is/ Majority developer of the ckpool code - k for kano
The ONLY active original developer of cgminer. Original master git: https://github.com/kanoi/cgminer
2112
Legendary
*
Offline Offline

Activity: 2128
Merit: 1065



View Profile
November 24, 2012, 06:49:51 PM
 #157

Thus why I'm certainly using USB direct for all ASIC USB devices - not Serial-USB and adding more overhead on top of it (and timing issues)
Thank you for the writeup. I'm not really familiar with building clusters using USB, I always worked with real serial HDLC/RS-232/RS-422 controllers or with Ethernet multicast.

The only real USB experience I had was with FTDI USB controllers. Neither ngzhang nor Enterpoint bothered to route all available signals from the serial chip to the FPGAs, so the high-bandwidth low-latency modes of transmission couldn't be used with them.

Hopefully the ASIC controller designers won't make the same mistakes and will allow you to use isochronous or bulk modes when the bus utilization becomes non-neglible.

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0
AfricanHunter
Full Member
***
Offline Offline

Activity: 157
Merit: 103


View Profile
November 26, 2012, 09:10:49 AM
 #158

Anyone know which way friedcat decided to go with this? ie.e sell hardware, shares, selfmine?

Thinking about doing business with johnniewalkerhttps://bitcointalk.org/index.php?action=profile;u=72227?
First read this thread https://bitcointalk.org/index.php?topic=131841.0

Also, Join the National Rifle Association to protect 2nd Amendment Rights http://membership.nrahq.org/default.asp?campaignid=XR020022
DutchBrat
Hero Member
*****
Offline Offline

Activity: 868
Merit: 1000


View Profile
November 26, 2012, 10:12:03 AM
 #159

Anyone know which way friedcat decided to go with this? ie.e sell hardware, shares, selfmine?

See this topic: ASICMINER https://bitcointalk.org/index.php?topic=99497.0

The first 16 TH (roughly) will be used for mining for the company, then the mining farm will be extended while the ASICs are sold to the general public
bcpokey
Hero Member
*****
Offline Offline

Activity: 602
Merit: 500



View Profile
November 26, 2012, 12:27:17 PM
 #160

I came in late, is the project still going forward despite all the GLBSE hoo ha?
Pages: « 1 2 3 4 5 6 7 [8] 9 10 11 12 13 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!