Bitcoin Forum
December 03, 2016, 04:46:48 AM *
News: Latest stable version of Bitcoin Core: 0.13.1  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 [3]  All
  Print  
Author Topic: Cyclone V now shipping!  (Read 12931 times)
pieppiep
Sr. Member
****
Offline Offline

Activity: 402



View Profile
April 11, 2012, 07:34:50 AM
 #41

As far as I understand, loops that are not 100% unrolled incur inefficiencies, as partial results have to be fed back in and there has to be logic (multiplexers) to do that, whereas data in a fully unrolled design just percolates from start to finish.
Interesting.
So if you can fit one fully unrolled instance in a device so you can get 1 hash/clock and fit another half unrolled instance that gets 1 hash/2 clocks, the second one holds back the speed of the first one.
Would it be possible to clock the first at a speed a little faster than the second one? Or would this give difficulties to combine the 2 parts to have 1 output?
1480740408
Hero Member
*
Offline Offline

Posts: 1480740408

View Profile Personal Message (Offline)

Ignore
1480740408
Reply with quote  #2

1480740408
Report to moderator
1480740408
Hero Member
*
Offline Offline

Posts: 1480740408

View Profile Personal Message (Offline)

Ignore
1480740408
Reply with quote  #2

1480740408
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1480740408
Hero Member
*
Offline Offline

Posts: 1480740408

View Profile Personal Message (Offline)

Ignore
1480740408
Reply with quote  #2

1480740408
Report to moderator
1480740408
Hero Member
*
Offline Offline

Posts: 1480740408

View Profile Personal Message (Offline)

Ignore
1480740408
Reply with quote  #2

1480740408
Report to moderator
lame.duck
Legendary
*
Offline Offline

Activity: 1242


View Profile
April 11, 2012, 09:14:50 AM
 #42

Unfortunately those numbers have nothing to do with a real working hasher, as a single hashing core needs 1/3  more LEs and i wonder if he gets the LE count back to the anounced 1250 LEs. In fact in another run i turned most area optimisations on and got only slightly better results. Besides that, his design has no communication module and the control logic seems incomplete to me as there is no logic to distribute the different nonces to the hashing cores.

Btw. as far i know the makomk design aims at the C7 grade device and it would worth a test what speed is possible with a C6 grade device. At least for the EP3C25 C7 grade device i got a bitstream reaching 117 MHz which should be sufficient to run an the aimed 120 MHz (=30 MHash).
fpgaminer
Hero Member
*****
Offline Offline

Activity: 546



View Profile WWW
April 11, 2012, 10:27:41 AM
 #43

Quote
So if you can fit one fully unrolled instance in a device so you can get 1 hash/clock and fit another half unrolled instance that gets 1 hash/2 clocks, the second one holds back the speed of the first one.
Would it be possible to clock the first at a speed a little faster than the second one? Or would this give difficulties to combine the 2 parts to have 1 output?
The inefficiencies of a rolled hasher are (usually) in area consumption, not in timing performance.

But to answer your question, yes you can clock different hashers at different speeds. Async FIFOs are used to cross the clock domains.

Jason
Member
**
Offline Offline

Activity: 114


View Profile
April 11, 2012, 02:03:27 PM
 #44

I too looked over Wondermine's code and I am skeptical that it will challenge either the Ztex code or Makomk's modifications of Fpgaminer's code in terms of MH/s.  Still, having said that, I wish him good luck as if he does manage it, we'll all benefit and learn something in the process.

BM-2D7sazxZugpTgqm3M2MCi5C1t8Du8BN11f
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 11, 2012, 02:53:51 PM
 #45

As far as I understand, loops that are not 100% unrolled incur inefficiencies, as partial results have to be fed back in and there has to be logic (multiplexers) to do that, whereas data in a fully unrolled design just percolates from start to finish.
Interesting.
So if you can fit one fully unrolled instance in a device so you can get 1 hash/clock and fit another half unrolled instance that gets 1 hash/2 clocks, the second one holds back the speed of the first one.
Would it be possible to clock the first at a speed a little faster than the second one? Or would this give difficulties to combine the 2 parts to have 1 output?

Yes, you can, and that would be a fallback strategy for the Cyclone V GX 7 in case one cannot fit two unrolled double-SHAs, however it'll hurt the $ per MH/s number.
makomk
Hero Member
*****
Offline Offline

Activity: 686


View Profile
April 11, 2012, 10:10:39 PM
 #46

makomk achieved 27.7MH/s from CycloneIV 22k part. My quess is that is 220MHz core rolled 8 times. Fully unrolled core fits to 75k Cyclone. That gives two cores on 150k part and propably at 300MHz (28nm vs. 60nm), so 600MH/s may be possible...
110 MHz at 4 clock cycles per hash, actually. The design scales down reasonably well to smaller devices.

Thanks for the reference.  Unfortunately, the URL to Makomk's code in the message you referenced does not exist, though perhaps his code is reflected by the DE2-115-makomk-mod branch of fpgaminer's code.  I just compiled that code with LOOP_LOG2 set to 0 and found that it compiles to 77,724 LEs/Fmax=109.84MHZ with the provided project settings, so probably 75K LEs can be achieved by optimizing for density (though this would reduce Fmax).
That's slightly older code. It's probably better than the newer versions with LOOP_LOG2=0 but it gives invalid results if you change it to anything else.

Quad XC6SLX150 Board: 860 MHash/s or so.
SIGS ABOUT BUTTERFLY LABS ARE PAID ADS
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 11, 2012, 11:09:58 PM
 #47

makomk achieved 27.7MH/s from CycloneIV 22k part. My quess is that is 220MHz core rolled 8 times. Fully unrolled core fits to 75k Cyclone. That gives two cores on 150k part and propably at 300MHz (28nm vs. 60nm), so 600MH/s may be possible...
110 MHz at 4 clock cycles per hash, actually. The design scales down reasonably well to smaller devices.

Thanks for the reference.  Unfortunately, the URL to Makomk's code in the message you referenced does not exist, though perhaps his code is reflected by the DE2-115-makomk-mod branch of fpgaminer's code.  I just compiled that code with LOOP_LOG2 set to 0 and found that it compiles to 77,724 LEs/Fmax=109.84MHZ with the provided project settings, so probably 75K LEs can be achieved by optimizing for density (though this would reduce Fmax).
That's slightly older code. It's probably better than the newer versions with LOOP_LOG2=0 but it gives invalid results if you change it to anything else.

So, can you make the latest code available somewhere?
I'm trying to instantiate two fully unrolled instances of your "slightly older code" on the Cyclone V GX 7 target architecture - so far, unsuccessfully.
Once instance is placed and routed just fine, and achieves 140 MH/s in the conservative "slow" simulation and a whopping 250 MH/s in the optimistic "fast" simulation.
Jason
Member
**
Offline Offline

Activity: 114


View Profile
April 12, 2012, 12:00:05 AM
 #48

So, can you make the latest code available somewhere?
I'm trying to instantiate two fully unrolled instances of your "slightly older code" on the Cyclone V GX 7 target architecture - so far, unsuccessfully.
Once instance is placed and routed just fine, and achieves 140 MH/s in the conservative "slow" simulation and a whopping 250 MH/s in the optimistic "fast" simulation.

I still think the fastest you're going to get will be just a bit over the slow simulation based on my experience with the Cyclone II and Cyclone IV.

Dunno if you're interested, but if you'd like to get some idea for what's possible with Altera Hardcopy and you have modified the code to allow you to compile multiple instances of Makomk's code, you might try targeting a Stratix IV, EP4SGX530HH35 which is a prototype for the Hardcopy HC4GX25 ASIC (530K LEs).  You could also try targeting one of the bigger Stratix V devices as a prototype for the Hardcopy V ASICs (up to 930K LEs) -- but I can't seem to find on the Altera website a list detailing which Stratix V is a prototype for which Hardcopy V like I can for the Hardcopy IV series.  Don't bother trying unless you have at least 8GB RAM on the machine you're using as the bigger FPGAs really use up a lot of memory during compilations.

I'm guessing somewhere in the 2.4 GH/s range is possible with the Hardcopy IV, and considerably more with the Hardcopy V, although cooling the die may be challenging.

By my back-of-the-envelope calculations, it's going to take around $0.75 million in capital to launch such a project -- including the setup fees for the hardcopy ($200K), 500 ASICs ($500K), and a little more for the design and production of 500 basic mining boards.  In order to beat BFL in terms of MH/$, it would almost certainly need to use the Hardcopy V ASIC which would conservatively give over 2 MH/$ performance if the ASIC could be adequately cooled.  At today's difficulty/valuation levels, each one of those boards should be able to mine around 2 bitcoins per day, or $300/month, with a projected payback period of 5 months -- not bad.  Any fearless investors out there?  Wink

BM-2D7sazxZugpTgqm3M2MCi5C1t8Du8BN11f
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 12, 2012, 12:12:58 AM
 #49

Nah, I cannot afford funding a Hardcopy device, and even if I barely could, I wouldn't invest all of my money into a Bitcoin miner. So many things can go wrong - difficulty can explode, the exchange rate can plummet, MtGox could be shut down by the Japanese government, etc. etc.

The LargeCoin folks up in Vancouver seem to be doing something like that, however.
Maybe even full-custom ASICs, judging from the low estimated power draw of their 20 GH/s box.
Jason
Member
**
Offline Offline

Activity: 114


View Profile
April 12, 2012, 02:28:52 AM
 #50

Nah, I cannot afford funding a Hardcopy device, and even if I barely could, I wouldn't invest all of my money into a Bitcoin miner. So many things can go wrong - difficulty can explode, the exchange rate can plummet, MtGox could be shut down by the Japanese government, etc. etc.

I agree.  The only way it makes sense it to spread the risk around.  Problem is finding 10 people to invest $75K, or even 100 people to invest $7.5K and then wait at least 6 months before seeing any return on their investments.

I am curious to see what level of performance can be achieved and will see how many miners will fit on a Stratix IV/V myself soon and report back my findings.  I think others have tried with some Xilinx products, but until they have a hardcopy equivalent, those results are less interesting.

Quote
The LargeCoin folks up in Vancouver seem to be doing something like that, however.
Maybe even full-custom ASICs, judging from the low estimated power draw of their 20 GH/s box.

I'm surprised someone would even attempt to do that for something as (relatively) obscure as LargeCoin.

BM-2D7sazxZugpTgqm3M2MCi5C1t8Du8BN11f
ElectricMucus
Legendary
*
Offline Offline

Activity: 1540


Drama Junkie


View Profile
April 12, 2012, 09:51:05 PM
 #51

Isn't the whole unrolling discussion really quite naive?

The optimal solution will be, depending on the layout of the chip, partly unrolled partly not, depending on which place inside the chip the operation takes place. It isn't really proficient to even talk about unrolling since depending on the amount of elements which can be used for memory, logic and routing a particular approach will be best.
Of course computing a solution which fits this paradigm would probably be too much, however if the tools permit programming the FPGA on a low level it should be at least possible to use close to 100% of all resources for something useful.

As far as I understand, loops that are not 100% unrolled incur inefficiencies, as partial results have to be fed back in and there has to be logic (multiplexers) to do that, whereas data in a fully unrolled design just percolates from start to finish.

Not if the looping is free. If there are abundant local resources at a specific place because of chip layout constraints the penalty used for looping and memory would decrease up to the point where it would be more efficient. But to be honest I have no idea how this should be implemented and don't even know if it is at all possible with available tools.
Nevertheless I think it is worth a thought and if FPGAs prevail over a long period as the status quo something will come up. I am certain of that.

Another speculative note: the I/O ports of the FPGA might eventually be used to obtain additional routing, this would of course restrict the layout but it should be worth it for cases where a slow rate data stream would consume unnecessary resources if routed over wide distances.

First they ignore you, then they laugh at you, then they keep laughing, then they start choking on their laughter, and then they go and catch their breath. Then they start laughing even more.
Pages: « 1 2 [3]  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!