Bitcoin Forum
February 26, 2017, 07:22:21 PM *
News: Latest stable version of Bitcoin Core: 0.13.2  [Torrent]. (New!)
 
   Home   Help Search Donate Login Register  
Pages: « 1 [2] 3 »  All
  Print  
Author Topic: Cyclone V now shipping!  (Read 13058 times)
lame.duck
Legendary
*
Offline Offline

Activity: 1246


View Profile
April 06, 2012, 10:49:53 AM
 #21

Just for curiousity i compiled a single DE2_115_makomk_mod with Quartus 11.1 (no SP) for the smallest CycloneV GX Speedgrade 8.  The report said 40% device usage, but the reported Fmax was quite low under 100 MHz. Maybe there is a bug in the timing analyzer since the reportet Fmax was  a little higher at 85°C. I think i will try it again with SP2 apllied even if the 'Not in stock' will last for a while (i guess)
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1488136941
Hero Member
*
Offline Offline

Posts: 1488136941

View Profile Personal Message (Offline)

Ignore
1488136941
Reply with quote  #2

1488136941
Report to moderator
1488136941
Hero Member
*
Offline Offline

Posts: 1488136941

View Profile Personal Message (Offline)

Ignore
1488136941
Reply with quote  #2

1488136941
Report to moderator
Dexter770221
Legendary
*
Offline Offline

Activity: 1028


View Profile
April 06, 2012, 11:41:31 AM
 #22

Smallest you can choose from, or smallest from Cyclone V family? I can only choose from different variants of 5CEA7 part.
40% is what I've calculated in my head just looking at ALM structure Wink So, two cores possible. But that clock is very low Sad Maybe throwing some flip flops (registers) at the output of ALM will help a little bit... Since its pipelined design it will not hurt performance.

Under development Modular UPGRADEABLE Miner (MUM). Looking for investors.
Changing one PCB with screwdriver and you have brand new miner in hand... Plug&Play, scalable from one module to thousands.
Jason
Member
**
Offline Offline

Activity: 114


View Profile
April 06, 2012, 01:31:08 PM
 #23

CycloneV ALMs have different structure than Cyclone IV. In one ALM you have four 4-input LUT + two 1 bit adders (with dedicated fast carry chains). So theoretically should be possible to put one stage into 160 ALMs (if I'm calculating right). 160*128=20480 ALMs for one fully unrolled core. 5CEA7 have over 56k ALMs, thats enough for two cores + some other logic for reciving, transmitting and distribuing work.

Nice.  Looks like the C5 is a bigger step up over the C4 than the C4 was over the C2.  I'm downloading 11.1, 11.1SP1, and 11.1SP2 now so I can try out some builds myself, but it looks like you'r right about fitting two fully unrolled loops in there -- the question now is what kind of Fmax can be achieved?

BM-2D7sazxZugpTgqm3M2MCi5C1t8Du8BN11f
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 09, 2012, 03:21:42 AM
 #24

Just for curiousity i compiled a single DE2_115_makomk_mod with Quartus 11.1 (no SP) for the smallest CycloneV GX Speedgrade 8.  The report said 40% device usage, but the reported Fmax was quite low under 100 MHz. Maybe there is a bug in the timing analyzer since the reportet Fmax was  a little higher at 85°C. I think i will try it again with SP2 apllied even if the 'Not in stock' will last for a while (i guess)

I tried that also and got 97 or 98 MHz Fmax, but the signal this pertains to is altera_reserved_tck, which (as far as I understand) is the JTAG clock, not the system clock.
In fact, I set the clock period of the system clock "osc_clk" to 8 ns in a constraint file, and the design compiled without error.
When I set it to 4 ns (more as a sanity check than as a serious attempt to achieve 250 MH/s), TimeQuest flagged errors.
6 ns ... stay tuned.
Dexter770221
Legendary
*
Offline Offline

Activity: 1028


View Profile
April 09, 2012, 10:03:30 AM
 #25

So, poor JTAG was a limitation? Good to know. We have to remember that this code is over 6 month old, When Spartan hit 90MH/s Wink Makomk achieved 27.5 MH/s on DE0-nano, with code that you're trying he only got little above 13 MH/s.

Under development Modular UPGRADEABLE Miner (MUM). Looking for investors.
Changing one PCB with screwdriver and you have brand new miner in hand... Plug&Play, scalable from one module to thousands.
lame.duck
Legendary
*
Offline Offline

Activity: 1246


View Profile
April 09, 2012, 10:59:44 AM
 #26

No, the jtag clock isn't the limiting factor, i had only set the  PLL to 120 MHz which seems not sufficient to get optimal synthesis results.
I was compiling for the smallest avaiable part offered by quartus, not the smallest caclone V (I mistakely mixed it with the xilinx artix7 where all sub100k parts were dropped). Btw. how did you select a (the) cycloneV E devise, i could only choose a GX part but the number of aluts  seem to be the same.
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 09, 2012, 01:58:14 PM
 #27

No, the jtag clock isn't the limiting factor, i had only set the  PLL to 120 MHz which seems not sufficient to get optimal synthesis results.
I was compiling for the smallest avaiable part offered by quartus, not the smallest caclone V (I mistakely mixed it with the xilinx artix7 where all sub100k parts were dropped). Btw. how did you select a (the) cycloneV E devise, i could only choose a GX part but the number of aluts  seem to be the same.


5CGXBC7C6F23C7

6 ns ... passes "fast 1100mV 0C model", fails both slow models
4 ns ... passes "fast 1100mV 0C model", fails both slow models
3 ns ... fails all 3 models
7 ns ... running as we speak
Jason
Member
**
Offline Offline

Activity: 114


View Profile
April 09, 2012, 02:19:30 PM
 #28

In fact, I set the clock period of the system clock "osc_clk" to 8 ns in a constraint file, and the design compiled without error.
When I set it to 4 ns (more as a sanity check than as a serious attempt to achieve 250 MH/s), TimeQuest flagged errors.
6 ns ... stay tuned.

Strange, Quartus ignores the jtag clock for me.  You might try setting a false path for the jtag clock in your constraints file in order to tell the system to ignore the jtag clock.  It should report Fmax for the main clock under the slow timing model section of the timing analyzer output.

So you were able to hit 125MHz so far -- that's pretty good and suggests that the Cyclone V can hit at least 250MH/s.  It will need to hit at least 300MH/s to make it competitive with the LX150 on a MH/$ basis, but it does seem at least possible that the Cyclone V may displace the LX150 as the new MH/$ leader.

A mining board with 2 Cyclone Vs can presumably be produced for about the same cost as a BFL single, but at 600MH/s, it's still less than three quarters of the BFL's mining rate.  I wonder if the reduced power consumption (should use less than one quarter of what the BFL uses) would entice many people to buy such a board in lieu of BFL's offering?  I'm guessing not, but I'm sure that won't stop someone from making them.  Maybe this is what ngzhang has up his sleeve for his Icarus replacement?

BM-2D7sazxZugpTgqm3M2MCi5C1t8Du8BN11f
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 09, 2012, 02:56:11 PM
 #29

In fact, I set the clock period of the system clock "osc_clk" to 8 ns in a constraint file, and the design compiled without error.
When I set it to 4 ns (more as a sanity check than as a serious attempt to achieve 250 MH/s), TimeQuest flagged errors.
6 ns ... stay tuned.

Strange, Quartus ignores the jtag clock for me.  You might try setting a false path for the jtag clock in your constraints file in order to tell the system to ignore the jtag clock.  It should report Fmax for the main clock under the slow timing model section of the timing analyzer output.

So you were able to hit 125MHz so far -- that's pretty good and suggests that the Cyclone V can hit at least 250MH/s.  It will need to hit at least 300MH/s to make it competitive with the LX150 on a MH/$ basis, but it does seem at least possible that the Cyclone V may displace the LX150 as the new MH/$ leader.

A mining board with 2 Cyclone Vs can presumably be produced for about the same cost as a BFL single, but at 600MH/s, it's still less than three quarters of the BFL's mining rate.  I wonder if the reduced power consumption (should use less than one quarter of what the BFL uses) would entice many people to buy such a board in lieu of BFL's offering?  I'm guessing not, but I'm sure that won't stop someone from making them.  Maybe this is what ngzhang has up his sleeve for his Icarus replacement?


7 ns also failed in both slow models, but only by a small margin. 7.3 ns should work (running now, but I have to drive to work now).

IMHO, these timing simulations are not 100% accurate - the real-life error rate you get with an actual clock, that's where the rubber meets the road.
The fast model works up to 4 ns (250 MHz) - that's very promising and maybe that's what you get in the real world.
Or maybe not.
We'll never know until someone builds an actual board.
Jason
Member
**
Offline Offline

Activity: 114


View Profile
April 09, 2012, 03:36:49 PM
 #30

IMHO, these timing simulations are not 100% accurate - the real-life error rate you get with an actual clock, that's where the rubber meets the road.
The fast model works up to 4 ns (250 MHz) - that's very promising and maybe that's what you get in the real world.
Or maybe not.
We'll never know until someone builds an actual board.

We can probably get an idea from how accurate the simulations are on older Cyclones.  For example, with the Cyclone IV on my DE2-115 dev board, I find that I can go only about 5-10MHz over the Fmax reported by the 85C Slow Model before I start seeing a few invalid blocks being reported by the pool I'm using.  That's with a small 23mm heat sink on the FPGA with a fan blowing on it.  According to my IR thermometer, the heat sink is under 40C.

You might try bringing up the timing advisor (Tools->Advisors->Timing Advisor) and changing settings to match some of the recommendations it makes if you have not already done so.  You might pick up a few tens or hundreds of picoseconds of slack that way.  Another thing worth trying if you have some patience and spare compute cycles is to run the Design Space Explorer on the design and see what it can come up with.  Make sure you do a test run with it first before you let it run for days on end so you don't wind up wasting your time like I have!

BM-2D7sazxZugpTgqm3M2MCi5C1t8Du8BN11f
fpgaminer
Hero Member
*****
Offline Offline

Activity: 546



View Profile WWW
April 10, 2012, 01:23:38 AM
 #31

Quote
7 ns also failed in both slow models, but only by a small margin. 7.3 ns should work (running now, but I have to drive to work now).
Is it still only failing on the JTAG clock or is it also failing on the hashing clock?

The code in my repo for Altera targets is based around my virtual_wire module, which isn't really the best (but it's easy to use on Altera dev boards where you already have a USB-Blaster).

You could try the DE2_115_makomk_serial project put together by teknohog, which uses a UART core. I'm designing a newer UART core with more functionality, etc, but that's not done yet and the one by teknohog is perfectly sufficient. You just need to make sure the makomk code in there is up-to-date (I haven't checked yet).

Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 10, 2012, 04:03:52 AM
 #32

Is it still only failing on the JTAG clock or is it also failing on the hashing clock?

It never "failed" on the JTAG clock, because I never put a time constraint on the JTAG clock.
At 7.3 ns clock cycle, the design passes.
Fmax is quoted as 139 MHz and 141 MHz for Slow 0C and Slow 85C, respectively.
(Yes, for some reason the Cyclone V is expected to run slightly faster at a higher temperature.)

I have not tried to fit two instances of the miner yet.

If you have the time, please elucidate the difference between LOOP_LOG2=0 and LOOP_LOG2=1.
I can't wrap my head around that - both of them are supposed to be fully unrolled ?!?
Which one is the "preferred", "believed to be faster" setting?

(I do understand that a setting of 2 means that there is only one SHA-256 instance and its output has to be fed back in front.
No need to set LOOP_LOG2 to 2 or higher on the Cyclone V.)

You could try the DE2_115_makomk_serial project put together by teknohog, which uses a UART core. I'm designing a newer UART core with more functionality, etc, but that's not done yet and the one by teknohog is perfectly sufficient. You just need to make sure the makomk code in there is up-to-date (I haven't checked yet).

Duly noted, but convenient I/O is not my main focus now - rather, I now want to focus on getting two instances fully placed and routed and timed.
fpgaminer
Hero Member
*****
Offline Offline

Activity: 546



View Profile WWW
April 10, 2012, 04:32:05 AM
 #33

Quote
If you have the time, please elucidate the difference between LOOP_LOG2=0 and LOOP_LOG2=1.
I can't wrap my head around that - both of them are supposed to be fully unrolled ?!?
Your confusion may arise from some typos I made in the comments of my code awhile back. I apologize for that (and hope most of those typos have been fixed).

The LOOP_LOG2 parameter determines how many times the entire Bitcoin SHA-256 pipeline is folded in half. Each folding cuts performance in half, but also cuts resource consumption in half (1).

  • LOOP_LOG2=0 -> Fully unrolled, one hash per clock cycle.
  • LOOP_LOG2=1 -> Half unrolled, one hash per 2 clock cycles.
  • LOOP_LOG2=2 -> Quarter unrolled, one hash per 4 clock cycles.
  • ...etc...

Quote
Duly noted, but convenient I/O is not my main focus now - rather, I now want to focus on getting two instances fully placed and routed and timed.
Mmmkay, I was mentioning it mostly due to it having probably better timing, and use far less resources (those virtual_wires are nothing to sneeze at).

(1) It should be noted that LOOP_LOG2=0, fully unrolled, has special advantages over the other settings due to constant optimization, dropping the last three rounds, etc. So if you were to graph LOOP_LOG2 vs. area consumption it would be linear from 1 onward, but not from 0 to 1.

Jason
Member
**
Offline Offline

Activity: 114


View Profile
April 10, 2012, 04:34:53 AM
 #34

I have not tried to fit two instances of the miner yet.

That has the potential to reduce the Fmax you can achieve, although if one fully unrolled miner uses only 40% of the device, the effect should be small.

Quote
If you have the time, please elucidate the difference between LOOP_LOG2=0 and LOOP_LOG2=1.
I can't wrap my head around that - both of them are supposed to be fully unrolled ?!?
Which one is the "preferred", "believed to be faster" setting?

LOOP_LOG2=0 is a fully unrolled miner -- 2 full SHA256 instances.  It achieves 1 clock cycle per bitcoin hash.  LOOP_LOG2=1 divides the work of each SHA-256 hasher in two so that they takes 2 clock cycles per bitcoin hash but use significantly fewer LEs.  You want to use LOOP_LOG2=0 whenever you can for best performance.

Quote
(I do understand that a setting of 2 means that there is only one SHA-256 instance and its output has to be fed back in front.
No need to set LOOP_LOG2 to 2 or higher on the Cyclone V.)

LOOP_LOG2 does not affect the number of SHA-256 instances.  There are always two of them.  It affects the amount of unrolling that is present in each of the SHA-256 instances.

LOOP_LOG2=0:  fully unrolled (2 fully unrolled SHA256 hasers)
LOOP_LOG2=1:  partially unrolled (2 clock cycles per output)
LOOP_LOG2=2:  partially unrolled (4 clock cycles per output)
etc.

Quote
Duly noted, but convenient I/O is not my main focus now - rather, I now want to focus on getting two instances fully placed and routed and timed.
Makes sense.  Who knows what form the I/O will take anyway until someone has designed/built some hardware around the Cyclone V.

You should be able to put two instances on the chip fairly easily either by creating a new top-level entity and instantiating two fpgaminer instances.  You'll probably have to parameterize the virtual_wire instance IDs in order to avoid collisions, but that may only be a problem at runtime so you might also be able to ignore it.

BM-2D7sazxZugpTgqm3M2MCi5C1t8Du8BN11f
ElectricMucus
Legendary
*
Offline Offline

Activity: 1540


Drama Junkie


View Profile
April 11, 2012, 12:51:26 AM
 #35

Isn't the whole unrolling discussion really quite naive?

The optimal solution will be, depending on the layout of the chip, partly unrolled partly not, depending on which place inside the chip the operation takes place. It isn't really proficient to even talk about unrolling since depending on the amount of elements which can be used for memory, logic and routing a particular approach will be best.
Of course computing a solution which fits this paradigm would probably be too much, however if the tools permit programming the FPGA on a low level it should be at least possible to use close to 100% of all resources for something useful.

First they ignore you, then they laugh at you, then they keep laughing, then they start choking on their laughter, and then they go and catch their breath. Then they start laughing even more.
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 11, 2012, 03:57:48 AM
 #36

OK, after 15 1/2 hours Quartus failed to fit two instances of the double-SHA, albeit at the "optimize for speed" setting.
I have now restarted it with the "optimize for space" setting. The tension is almost unbearable...
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 11, 2012, 04:03:55 AM
 #37

Isn't the whole unrolling discussion really quite naive?

The optimal solution will be, depending on the layout of the chip, partly unrolled partly not, depending on which place inside the chip the operation takes place. It isn't really proficient to even talk about unrolling since depending on the amount of elements which can be used for memory, logic and routing a particular approach will be best.
Of course computing a solution which fits this paradigm would probably be too much, however if the tools permit programming the FPGA on a low level it should be at least possible to use close to 100% of all resources for something useful.

As far as I understand, loops that are not 100% unrolled incur inefficiencies, as partial results have to be fed back in and there has to be logic (multiplexers) to do that, whereas data in a fully unrolled design just percolates from start to finish.
Jaryu
Member
**
Offline Offline

Activity: 91


View Profile
April 11, 2012, 04:26:29 AM
 #38

so basically what would be the... rough estimation of performance in MH/s for a chip if you can get the 2 instances fully loaded into it?
Inspector 2211
Sr. Member
****
Offline Offline

Activity: 383



View Profile
April 11, 2012, 04:42:40 AM
 #39

so basically what would be the... rough estimation of performance in MH/s for a chip if you can get the 2 instances fully loaded into it?

Impossible to say at this point, because
1) two instances don't even seem to fit
2) the difference between the slow estimate at 140 MH/s/instance and the fast estimate at 250 MH/s/instance is just too great.

The maximum seems to be 500 MH/s for two instances, but that's subject to too many assumptions to be realistic.
But then again, maybe wondermine comes up with a better implementation than fpgaminer.
Dexter770221
Legendary
*
Offline Offline

Activity: 1028


View Profile
April 11, 2012, 05:55:58 AM
 #40

wondermine has released new code. Look yourself what he achieved:
https://bitcointalk.org/index.php?topic=68352.msg844304#msg844304

Under development Modular UPGRADEABLE Miner (MUM). Looking for investors.
Changing one PCB with screwdriver and you have brand new miner in hand... Plug&Play, scalable from one module to thousands.
Pages: « 1 [2] 3 »  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!