Bitcoin Forum
April 26, 2024, 07:31:19 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 »
  Print  
Author Topic: Algorithmically placed FPGA miner: 255MH/s/chip, supports all known boards  (Read 119415 times)
eldentyrell (OP)
Donator
Legendary
*
Offline Offline

Activity: 980
Merit: 1004


felonious vagrancy, personified


View Profile WWW
October 26, 2011, 10:52:05 PM
Last edit: July 26, 2012, 08:36:25 AM by eldentyrell
 #1

I've been working on an unconventional design and doing 100%
algorithmic placement. This design contains three rings, each of
which is 64 stages of SHA-256.  Therefore, on each clock cycle each
ring gets 0.5 hashes (a nonce has to go through the ring twice in
order to be fully hashed).

Update, 20-Jul-2012: TML-1.0 has been released.  We are getting an average of 255MH/s/chip across a large farm of speed grade 2 chips (not the more-expensive grade 3 chips).  These are nexus6 boards, which have a power supply which is much more powerful than the one on the ztex boards.

All known boards (ztex, modminer, icarus, x6500, enterpoint, the-nameless-board-of-rph, nexus6) are supported.  You might need a JTAG cable.  If you would prefer to use your board's proprietary USB connector, contact your board vendor for assistance.

Additional announcement: during the last week, tricone mining has secured funding for its next (i.e. post-TML) bitcoin project (no, not the joke press release).  Unfortunately this means that between now and the end of August, further development of the TML will be at a reduced priority.  We will continue to maintain the signcryption servers, fix bugs, and produce new bitstreams (we have a farm of servers that sweeps through combinations of the various Xilinx tool command line parameters), so you can expect some minor improvements in performance and a steadily-growing supply of bitstreams to choose from.  However, new features will have to wait until 07-Sep-2012.  In order to compensate our users for the unfortunate timing of this priority shift, the commission rate is set to 0% until at least 07-Sep-2012.

Please do not send me forum PMs; I find the PM system very cumbersome to use.  My email address is in my profile.

______________________________________________________________________________

Update, 12-May-2012: Preliminary power numbers posted.

Update, 9-May-2012: New numbers posted.  This is the first design resulting from a substantial re-architecting begun about a month ago (and, I expect, the last one).  All the designs up to 145mhz were incremental improvements; the 170mhz design was a major change.

Update, 19-Mar-2012: sorry for the long delay.  Had two major
problems: one was a hardware/simulator mismatch that could
only be debugged with git bisect (thank you Linus!), but each
build takes ~24hours.  The other problem was voltage sag on my crappy
PCBs (I'm not a PCB designer), which was why the actual speeds were
below what the Xilinx tools said they should be.  I can now run at the
rated speed.  Results of overclocking tests will be posted shortly.  I
also have another build about to finish that should produce another
small bump in the design clock rate this weekend.

Update, 9-Mar-2012: new build finished last night, 140mhz
design rate, 135mhz actual clock rate, 202.5MH/s actual hashrate.
You'll notice that my actual clock rates are slower than my
design clock rates, meaning that I have to very slightly underclock
the design to get optimal error rates.  I don't know why this is,
although I only have a single SG-3 chip and it's on one of the first
PCB's I made -- before I got my reflow oven tuned up.

Update, 8-Mar-2012: The results are finally out of the
totally-embarrassing region, so I think I'll start posting them.  I
want to emphasize that these are very preliminary.  I've only
been doing performance optimization on the full design for three weeks
now.  As you can see, I'm getting these hashrates at 128mhz and others
have demonstrated that the SHA-256 critical path on a Spartan-6 can
run well above 200mhz.

All power consumption numbers are measured at the 12V rail of an
ATX power supply
(SATA connector).  So they include any
inefficiencies introduced by the 12V->1.2V stepdown.  Power
consumption measured at the wall will be higher; how much
depends on the power supply's efficiency.

I use adaptive clocking (DCM_CLKGEN) at 1mhz granularity (yes, I
figured out how to fix the jitter issues) and a separate clock
adjustment for each ring
.  This means that a performance defect in
one part of the chip will force the slowdown of only one of the three
rings.  This is another reason why I designed for three separate
half-pipelines instead of one full pipeline and one half-pipeline.

The "actual" power and frequency numbers produced by running the
design on an SG-3 chip at 1200mV, adapting the three clocks to
maximize (non-erroneous) hashrate, then averaging the three
frequencies.  Most of my cluster is SG-2, but I quote the SG-3 numbers
since that's what everybody else quotes numbers for.  I have digital
control of the voltage supply on my boards, so I'll try adaptive
overvolting at some point; the chips are rated up to 1320mV.

See the end of the thread for map output, or if you are
interested in licensing.


The printing press heralded the end of the Dark Ages and made the Enlightenment possible, but it took another three centuries before any country managed to put freedom of the press beyond the reach of legislators.  So it may take a while before cryptocurrencies are free of the AML-NSA-KYC surveillance plague.
1714159879
Hero Member
*
Offline Offline

Posts: 1714159879

View Profile Personal Message (Offline)

Ignore
1714159879
Reply with quote  #2

1714159879
Report to moderator
1714159879
Hero Member
*
Offline Offline

Posts: 1714159879

View Profile Personal Message (Offline)

Ignore
1714159879
Reply with quote  #2

1714159879
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714159879
Hero Member
*
Offline Offline

Posts: 1714159879

View Profile Personal Message (Offline)

Ignore
1714159879
Reply with quote  #2

1714159879
Report to moderator
1714159879
Hero Member
*
Offline Offline

Posts: 1714159879

View Profile Personal Message (Offline)

Ignore
1714159879
Reply with quote  #2

1714159879
Report to moderator
ElectricMucus
Legendary
*
Offline Offline

Activity: 1666
Merit: 1057


Marketing manager - GO MP


View Profile WWW
October 26, 2011, 10:54:04 PM
 #2

wow, this is incredible  Cool

Are you going to share it? Under which conditions?
Very, very interesting Wink
tinman951
Full Member
***
Offline Offline

Activity: 149
Merit: 100


View Profile
October 27, 2011, 12:25:54 AM
 #3

Just wondering, what OS, programs, and other stuff are you running.  Let's say I have the hardware (which I obviously don't).  How would you set this up from a flash drive?

Free micro bitcoins: http://www.bitvisitor.com/?ref=1DFw1VncjVhqdg6GWoQ6Qtc5ncR5RHXxfP
Donate BTC: 18b94MMTWd7bWaUWcq7VHmizgswP2dK6fM
pusle
Member
**
Offline Offline

Activity: 89
Merit: 10


View Profile
October 27, 2011, 07:23:40 AM
 #4


@big-chip-small-board:  So you get 100Mhash/s from this design at what size fpga?
eldentyrell (OP)
Donator
Legendary
*
Offline Offline

Activity: 980
Merit: 1004


felonious vagrancy, personified


View Profile WWW
October 27, 2011, 08:02:34 PM
 #5

@big-chip-small-board:  So you get 100Mhash/s

As you can see, there is room for at least one more copy of the pipeline on the chip.

from this design at what size fpga?

You're looking at an LX150.


The printing press heralded the end of the Dark Ages and made the Enlightenment possible, but it took another three centuries before any country managed to put freedom of the press beyond the reach of legislators.  So it may take a while before cryptocurrencies are free of the AML-NSA-KYC surveillance plague.
Dexter770221
Legendary
*
Offline Offline

Activity: 1029
Merit: 1000


View Profile
October 27, 2011, 08:48:16 PM
 #6

And 3 months ago someone said that theres no way to achieve 200MH/s Wink

Under development Modular UPGRADEABLE Miner (MUM). Looking for investors.
Changing one PCB with screwdriver and you have brand new miner in hand... Plug&Play, scalable from one module to thousands.
eldentyrell (OP)
Donator
Legendary
*
Offline Offline

Activity: 980
Merit: 1004


felonious vagrancy, personified


View Profile WWW
October 27, 2011, 09:06:39 PM
 #7

And 3 months ago someone said that theres no way to achieve 200MH/s Wink

In fairness, I still do not have the "corner turn" routes running at that speed.  But I have verified that the circuit you see works (i.e. mines actual shares)... though at 45mhz since the corner turn routes are a ridiculous 22ns.

So, no celebrations just yet...

The printing press heralded the end of the Dark Ages and made the Enlightenment possible, but it took another three centuries before any country managed to put freedom of the press beyond the reach of legislators.  So it may take a while before cryptocurrencies are free of the AML-NSA-KYC surveillance plague.
ElectricMucus
Legendary
*
Offline Offline

Activity: 1666
Merit: 1057


Marketing manager - GO MP


View Profile WWW
October 27, 2011, 09:07:40 PM
 #8

Excuse the nagging, but are you planning to share your work?  Grin
eldentyrell (OP)
Donator
Legendary
*
Offline Offline

Activity: 980
Merit: 1004


felonious vagrancy, personified


View Profile WWW
October 27, 2011, 09:10:26 PM
 #9

Excuse the nagging, but are you planning to share your work?  Grin

Undecided, for now.  But honestly, at 45MH/sec it doesn't matter just yet.

The printing press heralded the end of the Dark Ages and made the Enlightenment possible, but it took another three centuries before any country managed to put freedom of the press beyond the reach of legislators.  So it may take a while before cryptocurrencies are free of the AML-NSA-KYC surveillance plague.
eldentyrell (OP)
Donator
Legendary
*
Offline Offline

Activity: 980
Merit: 1004


felonious vagrancy, personified


View Profile WWW
October 27, 2011, 09:12:10 PM
 #10

But I have verified that the circuit you see works (i.e. mines actual shares)... though at 45mhz since the corner turn routes are a ridiculous 22ns.

Oh, and, to Xilinx: whatever you did to columns 66 and 67 drives me berzerk.  I can't use the entire pair of columns because slices are randomly MISSING for no apparent reason.

The printing press heralded the end of the Dark Ages and made the Enlightenment possible, but it took another three centuries before any country managed to put freedom of the press beyond the reach of legislators.  So it may take a while before cryptocurrencies are free of the AML-NSA-KYC surveillance plague.
rph
Full Member
***
Offline Offline

Activity: 176
Merit: 100


View Profile
October 28, 2011, 06:14:09 AM
Last edit: October 28, 2011, 09:57:00 AM by rph
 #11

Heh, it was a matter of time until somebody hand-placed it. Hardcore.

-rph

Ultra-Low-Cost DIY FPGA Miner: https://bitcointalk.org/index.php?topic=44891
eldentyrell (OP)
Donator
Legendary
*
Offline Offline

Activity: 980
Merit: 1004


felonious vagrancy, personified


View Profile WWW
October 28, 2011, 09:14:38 PM
 #12

Heh, it was a matter of time until somebody hand-placed it. Hardcore.

Emphasis on "time" Smiley

FWIW, I don't think it's feasible to do this directly in Verilog/VHDL.  I had to write a Java library to generate the (totally illegible) Verilog.

The printing press heralded the end of the Dark Ages and made the Enlightenment possible, but it took another three centuries before any country managed to put freedom of the press beyond the reach of legislators.  So it may take a while before cryptocurrencies are free of the AML-NSA-KYC surveillance plague.
gmaxwell
Moderator
Legendary
*
Offline Offline

Activity: 4158
Merit: 8382



View Profile WWW
October 28, 2011, 10:29:49 PM
 #13

Emphasis on "time" Smiley
FWIW, I don't think it's feasible to do this directly in Verilog/VHDL.  I had to write a Java library to generate the (totally illegible) Verilog.

I am reloading this thread like a crazed weasel.  I'm quite eager to see what kind of results you get from the manual placement.
BTCurious
Hero Member
*****
Offline Offline

Activity: 714
Merit: 504


^SEM img of Si wafer edge, scanned 2012-3-12.


View Profile
October 28, 2011, 10:37:05 PM
 #14

Oh wow. I know next to nothing about FPGAs*, but this looks very awesome nonetheless.

*I know that they're field-programmable gate arrays, and they're basically logic components which can be reconnected in different ways, making different chips depending on what the designer wants.

rph
Full Member
***
Offline Offline

Activity: 176
Merit: 100


View Profile
October 30, 2011, 05:02:21 AM
Last edit: October 30, 2011, 05:13:45 AM by rph
 #15

ztex reached 200MHz on -3 with ISE 13.2 - it builds in about 40 minutes total (map + par) with -xt 5 -t 19.
I bet he used a lot of CPU time to find those settings..  Grin

TBH I don't think hand placement will improve Fmax very much, although it could reduce the build time.

-rph

Ultra-Low-Cost DIY FPGA Miner: https://bitcointalk.org/index.php?topic=44891
eldentyrell (OP)
Donator
Legendary
*
Offline Offline

Activity: 980
Merit: 1004


felonious vagrancy, personified


View Profile WWW
October 30, 2011, 09:14:39 PM
Last edit: October 30, 2011, 09:26:41 PM by big-chip-small-board
 #16

Emphasis on "time" Smiley
FWIW, I don't think it's feasible to do this directly in Verilog/VHDL.  I had to write a Java library to generate the (totally illegible) Verilog.

I am reloading this thread like a crazed weasel.  I'm quite eager to see what kind of results you get from the manual placement.

Corner turns now running at 80mhz.

I know this sounds completely nuts, but there is a very, very, very remote chance of being able to cram three full pipelines onto the chip.  If that doesn't work I can put a "half-pipeline" in the empty space, although this takes a bit more than 50% of the area of a full pipeline (since I can't hardwire the K-values into the LUT equations anymore).  So I think the clock:hashes ratio will be at least 2clocks:2.5hashes.

All of my results are on the less-expensive -2 chip (not the more-expensive -3).

The next week will probably be pretty quiet, I have a major non-bitcoin deadline I have to deal with.  Progress should resume after 9-Nov.

The printing press heralded the end of the Dark Ages and made the Enlightenment possible, but it took another three centuries before any country managed to put freedom of the press beyond the reach of legislators.  So it may take a while before cryptocurrencies are free of the AML-NSA-KYC surveillance plague.
eldentyrell (OP)
Donator
Legendary
*
Offline Offline

Activity: 980
Merit: 1004


felonious vagrancy, personified


View Profile WWW
October 30, 2011, 09:17:22 PM
 #17

TBH I don't think hand placement will improve Fmax very much, although it could reduce the build time.

If you leave half the chip empty, then yes -- automatic placement will get pretty much the same frequency as hand placement.

If the chip is nearly full, automatic placement won't come even close to hand-placement -- if it can finish at all.

It's been three weeks since PAR has been able to finish my design with the RLOCs turned off at any frequency -- I even tried at 1mhz!  If the placement is crap there simply aren't enough wires to get from point A to point B.

It also helps to do your layout keeping in mind the wiring structure of the Spartan switchboxes: fast-path between slices in the same CLB, the 1x 2x 4x routing lines, and anything longer than that is too slow to be worth thinking about (unless it's a pure register-to-register path with no combinational logic).

The printing press heralded the end of the Dark Ages and made the Enlightenment possible, but it took another three centuries before any country managed to put freedom of the press beyond the reach of legislators.  So it may take a while before cryptocurrencies are free of the AML-NSA-KYC surveillance plague.
sadpandatech
Hero Member
*****
Offline Offline

Activity: 504
Merit: 500



View Profile
October 31, 2011, 02:04:39 AM
 #18

TBH I don't think hand placement will improve Fmax very much, although it could reduce the build time.

If you leave half the chip empty, then yes -- automatic placement will get pretty much the same frequency as hand placement.

If the chip is nearly full, automatic placement won't come even close to hand-placement -- if it can finish at all.

It's been three weeks since PAR has been able to finish my design with the RLOCs turned off at any frequency -- I even tried at 1mhz!  If the placement is crap there simply aren't enough wires to get from point A to point B.

It also helps to do your layout keeping in mind the wiring structure of the Spartan switchboxes: fast-path between slices in the same CLB, the 1x 2x 4x routing lines, and anything longer than that is too slow to be worth thinking about (unless it's a pure register-to-register path with no combinational logic).

  MMM, talk nerdy to me! Subbed just to follow. Though I am curious to know what you did to improve the corner turns and do you already have solutions in mind to improve them further?

If you're not excited by the idea of being an early adopter 'now', then you should come back in three or four years and either tell us "Told you it'd never work!" or join what should, by then, be a much more stable and easier-to-use system.
- GA

It is being worked on by smart people.  -DamienBlack
pusle
Member
**
Offline Offline

Activity: 89
Merit: 10


View Profile
October 31, 2011, 08:41:25 PM
 #19


Cudos to you for doing this manual routing! it's one hell of a job  Grin

a couple of ideas:

Perhaps  you could use the dedicated busses between the DSP blocks to help out with the speed of the "corner turns".
Or put RAM blocks between there with a few cycles pipe delay to separate the "trees"

It might also be worth investigating smaller devices to get a better "fit" since the cost pr LUT is fairly constant.
They are also easier to cool and might run faster and/or have more headroom for overclocking
BTCurious
Hero Member
*****
Offline Offline

Activity: 714
Merit: 504


^SEM img of Si wafer edge, scanned 2012-3-12.


View Profile
November 01, 2011, 07:47:04 AM
 #20

I'm trying to derive the theory from reading the FPGA threads, and some rudimentary knowledge. I'm curious if I'm close, can anyone let me know if I'm wrong?

FPGAs (Field programmable gate arrays) are collections of logic gates on a chip. The gate types themselves can be changed, and the wiring between them can be changed, so you can make your own high speed chip layout.

Every timing tick, all gates "do their process", and update their outputs, based on what they just had as inputs.

Corner turns: Is this like, all processing is flowing to the right, and then at the end of the chip you need to do some wiring or tricks to continue processing to the left?

Pages: [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!