mnemonix
Newbie
Offline
Activity: 19
Merit: 0
|
 |
April 20, 2013, 08:31:59 AM |
|
Thx for your work you put in the miner!
I ported the Xilinx_VHDL miner to the ml605 dev board.
Actually, straight forward ... Replaced the dcm with a newer Virtex6-aquivalent, wired the pins to rs232 and clock, adjusted the baud rate and it run instantly.
It does 200MHash/sec and is user by about 85% ...
|
|
|
|
kramble
|
 |
April 20, 2013, 08:52:43 AM |
|
NO!! Don't confuse gate with LE (logic element). Older fpga's often quoted a gate count (such as the one you linked to Spartan 3E 250K gates). Newer fpga's use a Logic Element (or Logic Cell) count (and google tells me there are 12 gates to a LE). So a Spartan 6 LX150 with 147,443 logic cells roughly equates to 1.7 million gates by my calculation (I can't find any direct quote for the actual figure, so take that as very approximate). You can see the spartan family spec at http://www.xilinx.com/support/documentation/data_sheets/ds160.pdfThe board you linked to will be (almost) useless for mining. You need to look for a purpose-built Spartan LX150 based miner and use the firmware (bitstream) that comes with it (and even then the economics look pretty grim). If you want to compile your own bitstream for the Spartan series, you can download free software from the Xilinx web site http://www.xilinx.com/products/design-tools/ise-design-suite/ise-webpack.htm but beware that it is limited to the smaller devices (LX75 maximum I think, but do your own due dilligence). You need the full (very expensive) version to compile for the LX150. Regards Mark
|
|
|
|
AJRGale
|
 |
April 20, 2013, 10:02:25 AM |
|
NO!! Don't confuse gate with LE (logic element). Older fpga's often quoted a gate count (such as the one you linked to Spartan 3E 250K gates). Newer fpga's use a Logic Element (or Logic Cell) count (and google tells me there are 12 gates to a LE). So a Spartan 6 LX150 with 147,443 logic cells roughly equates to 1.7 million gates by my calculation (I can't find any direct quote for the actual figure, so take that as very approximate). You can see the spartan family spec at http://www.xilinx.com/support/documentation/data_sheets/ds160.pdfThe board you linked to will be (almost) useless for mining. You need to look for a purpose-built Spartan LX150 based miner and use the firmware (bitstream) that comes with it (and even then the economics look pretty grim). If you want to compile your own bitstream for the Spartan series, you can download free software from the Xilinx web site http://www.xilinx.com/products/design-tools/ise-design-suite/ise-webpack.htm but beware that it is limited to the smaller devices (LX75 maximum I think, but do your own due dilligence). You need the full (very expensive) version to compile for the LX150. Regards Mark Ah, Sorry for my newbishness, never played with one of these devices (blame the 2 companies for their heavy secretive efforts unless you buy their $5000 suite) my mistake, so when a company quotes "Gates" number, i have to look for ALM, LE, Slice etc? Basically i want to know what a full miner roll out fits on, how many LEs i'll go to digi-key and look something up and go from there
|
|
|
|
minernb
Newbie
Offline
Activity: 14
Merit: 0
|
 |
April 20, 2013, 10:56:36 PM |
|
Basically i want to know what a full miner roll out fits on, how many LEs i'll go to digi-key and look something up and go from there
Hi, The Altera DE1 has 18K LE. The non-optimized version fits using the factor 4 in the roll(?), for a total of 16K LE used. I get 3.10 MH/s. The makomk_mod version fits using factor 2 (but all works are rejected, I don't know way!). It reports 12MH/s.
|
|
|
|
kramble
|
 |
April 21, 2013, 08:30:24 AM Last edit: April 21, 2013, 08:49:39 AM by kramble |
|
The makomk_mod version fits using factor 2 (but all works are rejected, I don't know way!). It reports 12MH/s.
I had the same problem with the DE0-Nano (22k LE), this was Makomk's response ... I've now started looking at the code in the DE2_115_makomk_mod branch, but I've hit a problem. The code compiles fine at CONFIG_LOOP_LOG2=2, 3 and 4 but its producing the wrong hashes (I'm just running at 40MHz for testing, not full blast) ... the mine.tcl script submits hashes to the pool, but they are all rejected! Yeah, that branch doesn't work with CONFIG_LOOP_LOG2!=1. You probably want http://www.makomk.com/gitweb/?p=Open-Source-FPGA-Bitcoin-Miner.git;a=summary de0-nano-hax branch, projects/DE2_115_Unoptimized_Pipelined project. The voltage regulators are also indeed horribly inefficient on the DE0-nano. I can't answer AJRGale's query about the LE's needed for a fully unrolled core as I haven't built anything larger than a one-sixth core which (just) fitted into 22k LE on an EP4CE22 on the Nano. Regards Mark
|
|
|
|
senseless
|
 |
April 21, 2013, 10:40:32 AM Last edit: April 21, 2013, 11:07:47 AM by senseless |
|
This is a DSP48E1 based design, and I have compiled and run it at 400MH/s.
Have you done any testing as to which adders provide the best increase to the fmax? In order to get multiple cores in there going to need to pick and choose which adders to replace with dsps and which not to. I'm currently at 66% LUT usage with 99% memory LUT and 108% dsp usage with 2 unrolled cores (I had one core do even nonces while the other does odd nonces to make life easy). I've been slowly working down the number of dsps utilized per core to make it fit. I'm thinking it might be possible to get 3 full cores on the A7 200. Does the DSP performance increase compound? If I change one adder over to DSP utilization and it gives a 10% fmax increase... would changing additional adders down the chain affect that 10%? or will that one adder always give a 10% boost? I'm wondering if it will be possible to go through the adders one by one and calculate the increase in frequency for each one to find which adders would be the most effectively utilized under DSP48 blocks to get the best timing.
|
|
|
|
anomalies
Newbie
Offline
Activity: 13
Merit: 0
|
 |
April 22, 2013, 01:55:25 AM |
|
hi, another question from a newbs..  have any of you guys heard of parallella? http://www.parallella.orgwhat you guys think about it? 
|
|
|
|
AJRGale
|
 |
April 22, 2013, 04:21:40 AM |
|
hi, another question from a newbs..  have any of you guys heard of parallella? http://www.parallella.orgwhat you guys think about it?  Ahh yes, that my friend is a completely different ball game to FPGA i've been waiting for them to kick off, i want one to play with 64 threads per chip... mmmm
|
|
|
|
paszczakojad
Newbie
Offline
Activity: 15
Merit: 0
|
 |
April 24, 2013, 03:33:54 PM Last edit: April 24, 2013, 06:49:38 PM by paszczakojad |
|
This is a DSP48E1 based design, and I have compiled and run it at 400MH/s.
Have you done any testing as to which adders provide the best increase to the fmax? In order to get multiple cores in there going to need to pick and choose which adders to replace with dsps and which not to. I'm currently at 66% LUT usage with 99% memory LUT and 108% dsp usage with 2 unrolled cores (I had one core do even nonces while the other does odd nonces to make life easy). I've been slowly working down the number of dsps utilized per core to make it fit. I'm thinking it might be possible to get 3 full cores on the A7 200. Does the DSP performance increase compound? If I change one adder over to DSP utilization and it gives a 10% fmax increase... would changing additional adders down the chain affect that 10%? or will that one adder always give a 10% boost? I'm wondering if it will be possible to go through the adders one by one and calculate the increase in frequency for each one to find which adders would be the most effectively utilized under DSP48 blocks to get the best timing. I compiled fpgaminer's DSP code on A7 200 and I got 356 MHz on -3 grade, 311 MHz on -2 grade and 262 MHz on -1. The -3 variant only exists in extended temperature version, so it's much more expensive - so the -2 is the best choice in my opinion. The usage was 20% slice logic, 34% slice logic distribution and 92% DSP. What were your results? I.e. what maximum clocking do you have without DSP? Now I'm trying to replace some DSPs with adder IP core - I think best candidates are these that don't use PCIN input (because they are simpler), like dsp_e, dsp_wp and dsp_t1p. When I replaced dsp_e with adder I got 302 MHz (-2 version), 23% logic, 37% distrib, 75% DSP. Then I replaced dsp_wp: 271 MHz, 24% logic, 38% distrib, 63% DSP. Compilation took over 5 hours, while it takes 30 min when using only DSP. Then I replaced dsp_t1p and the compilation takes ages to complete (it didn't complete yet) The estimation is that DSP usage will be 49%, so theoretically I should be able to fit two such cores. Even if I have to lower the clock to, say, 200 MHz then total output would be 400 MH/s, which would be better than 311 MH/s with one DSP-only core.
|
|
|
|
fpgaminer (OP)
|
 |
April 25, 2013, 12:14:13 AM |
|
When I replaced dsp_e with adder I got 302 MHz I find it odd that your Fmax is dropping when you replace the DSPs with LUTs. You may want to fiddle around with Vivado's settings to make sure register retiming (or whatever Vivado calls it) is enabled. Alternatively, implement the adders as two stages of 16-bits each. Since the DSPs that are being replaced are two stage (or three) anyway. Also, for dsp_t1p, it would be best to replace both dsp_t1p and compressor_t1p with a single LUT adder, since the LUT fabric can implement 3 way additions just as efficiently as 2-way addition.
|
|
|
|
paszczakojad
Newbie
Offline
Activity: 15
Merit: 0
|
 |
April 25, 2013, 05:10:48 AM |
|
When I replaced dsp_e with adder I got 302 MHz I find it odd that your Fmax is dropping when you replace the DSPs with LUTs. You may want to fiddle around with Vivado's settings to make sure register retiming (or whatever Vivado calls it) is enabled. Alternatively, implement the adders as two stages of 16-bits each. Since the DSPs that are being replaced are two stage (or three) anyway. I used 2-stage adders, because DSP adders worked in 2 cycles and I didn't want to debug too much. IP core generator recommended 3 cycles for the best performance - I'll try that next. After replacing dsp_e, dsp_wp and dsp_t1p I got 46% DSPs used - so it's enough to fit two cores.
|
|
|
|
Khertan
|
 |
May 03, 2013, 06:30:39 PM |
|
I m currently playing with the DE0 Nano code from Kramble.
And i ve a question, you said that running it at higher speed than 40Mhz could damage an unmodified DE0 Nano, and i didn't understand why.
As from Quartus PowerPlay Power Analyser, the design at 50 Mhz use only 328mW, that s arround 273mA right ? it s supposed to support 500mA, isn't it ?
Did i miss something ?
|
|
|
|
kramble
|
 |
May 03, 2013, 08:32:43 PM Last edit: May 03, 2013, 09:12:41 PM by kramble |
|
I m currently playing with the DE0 Nano code from Kramble.
And i ve a question, you said that running it at higher speed than 40Mhz could damage an unmodified DE0 Nano, and i didn't understand why.
As from Quartus PowerPlay Power Analyser, the design at 50 Mhz use only 328mW, that s arround 273mA right ? it s supposed to support 500mA, isn't it ?
Did i miss something ?
No, I was just being conservative in case someone inexperienced just cranked it up to the max (and following the example of fpgaminer in his original readme). You can run it faster as long as you are happy the power supply will support it (I had a conversation with hardcore_fc a few months back about the regulators, it may be worth you looking back over it). I am currently running one board at 170Mhz (with a hardwired external 1.2V core supply as described at www.makomk.com) and a second at 80MHz on a conventional 3.3V external supply. You are correct that a USB supply will probably be limited to 500mA, but this is at 5Volts. I haven't played with the Powerplay Analyser, but I would expect that this is reporting the power at the 1.2V fpga core rail. You have to account for the other devices on the DE0-Nano board too. I just dug out some notes I made of measurements with the 3.3V supply. 40Mhz was 0.48A, 80Mhz 0,85A, 100Mhz 1.0A, 120MHz 1.2A and 140Mhz 1.36A, so roughly 10mA per Mhz. The regulators were getting very hot at the higher speeds (even though I was pointing a fan at the board), hence my caution at running the DE0-Nano at these sorts of speeds. The regulators themselves are overtemperature protected, but looking at the datasheet, this only kicks in at T(junction) of 175C, while the max operating temperature is 125C. It also quotes 85C/Watt junction-ambient assuming a big chunk of PCB copper dedicated to heatsinking, so you can work out roughly what they can practically support. Given the tiny returns from mining on the Nano, my opinion was that its not worth risking the boards at the higher speeds. I'm happy with my current setup (as described above) as nothing is getting above 60C, but its your call on your own stuff. [EDIT] I should add that I'm using a serial interface to communicate with the boards, rather than the quartus_stp jtag usb cable, which is why I can get away with a 3.3V external supply. If you are using the usb for communication, then an external 3.3V supply won't work as it will pull current from the usb instead (there are a couple of blocking diodes so no harm should occur). You could use a 5V external supply to supplement the usb's 500mA, but then its all getting a bit Heath Robinson, and the onboard regulators are under more heat stress at 5V than 3.3V. Oh, and the DE0-Nano manual says the minimum external supply is 3.6V (I just happened to have 3.3V to hand and it worked fine, but its technically out of spec so YMMV). Regards Mark
|
|
|
|
fpgaminer (OP)
|
 |
May 03, 2013, 10:20:06 PM |
|
I've been asked a few times about a mining script for the current KC705 firmware. I wrote a plugin for Modular Python Bitcoin Miner. Here's the message I sent to someone about it: I uploaded the custom MBPM module, which is compatible with the current KC705 mining code, here: https://mega.co.nz/#!Oh5HTDRB!C0RLYW4yZN8gbg38FfgLpzmKFcseOql3Xx1i_gXTfdMYou'll want to download a copy of MPBM's testing branch. Then extract the above archive into Code: modules/fpgamining such that you end up with: Code: modules/fpgamining/kc705_uart/__init__.py modules/fpgamining/kc705_uart/kc705uartworker.py Once you start MPBM, you can now add a KC705 Worker by openning up the MPBM web-interface ( http://127.0.0.1:8832) and clicking the "Workers" button on the left. On Windows, I ran MPBM under Cygwin, and the "Port" ended up being /dev/com2 for me. The Baudrate is 115200. ~fpgaminer I haven't had a chance to clean it up and put it on the repo yet.
|
|
|
|
gingernuts
Member

Offline
Activity: 89
Merit: 10
|
 |
May 04, 2013, 12:01:41 AM |
|
Looking at Digikey right now,for the chips you could actually buy today,
The Small Kintex XC7K160T is $230 ish in -1 grade and $280 ish in -2 grade The Biggest Artix XC7A200T is $200 ish in -1 grade and $270 ish in -2 grade and both of these can be developed with the free Webpack software
The Kintex used on the KC705, XC7K325T is $1000 ish in the -1 grade, and $1500 odd in the -2 grade (They have a $1200 one, but not in stock), and needs a full Vivado/ISE license to play with - even if I were to buy a KC705 dev-kit, I can't see how the 325T device is going to be good bang for the buck...
Interestingly in a Kintex -> Artix migration guide Xilinx seem to reckon that a -1 grade Kintex is 1.6x as fast as a -1 Artix so while the 7A200T looks like a winner in terms of price and slices/DPS modules, I'm wondering whether the Kintex XC7K160 might not be the best value overall...
|
|
|
|
|
Khertan
|
 |
May 05, 2013, 04:16:55 PM Last edit: May 05, 2013, 07:54:52 PM by Khertan |
|
Given the tiny returns from mining on the Nano, my opinion was that its not worth risking the boards at the higher speeds. I'm happy with my current setup (as described above) as nothing is getting above 60C, but its your call on your own stuff.
Regards Mark
Thanks, indeed for bitcoin mining i ll not risk to burn mine little nano, i'm asking because i'm working on a other project, i want to understand things to not burn it.  I ll try to monitor the usb power used and temperature. At 40Mhz PowerPlay estimate 296mA ... for the fpga only of course. But i've play with settings to reduce power usage from your original code / project settings. So look like powerplay underestimate power usage Thanks a lot for your explanation.
|
|
|
|
xbaby
Newbie
Offline
Activity: 16
Merit: 0
|
 |
May 07, 2013, 06:07:36 AM |
|
I'm trying to compile the "projects/X6000_ztex_comm4" myself, for devices "xc6slx150, speed -3", under Xilinx ISE v13.4, and code from Github without any modification. using default compiling option from "xilinx_fpgaminer.xise", under the goal of "Timing Performance", the placement failed. after change goal to "Minimum Runtime", the project compiled successfully, but the timing constrains can't be met. from the PAR report, the clock speed is only 153MHz (cycle 6.54ns). I'd like to ask what optimization options need to use to achieve > 190MHz clock speed? please help me, thanks very much. +-------------------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+ | | Period | Actual Period | Timing Errors | Paths Analyzed | | Constraint | Requirement |-------------+-------------|-------------+-------------|-------------+-------------| | | | Direct | Derivative | Direct | Derivative | Direct | Derivative | +-------------------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+ |TS_CLK_100MHZ | 10.000ns| 9.689ns| 13.082ns| 0| 633| 1456| 3690036| | TS_dynamic_clk_blk_clkfx | 5.000ns| 6.541ns| N/A| 633| 0| 3690036| 0| +-------------------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
Slice Logic Utilization: Number of Slice Registers: 84,129 out of 184,304 45% Number used as Flip Flops: 84,129 Number used as Latches: 0 Number used as Latch-thrus: 0 Number used as AND/OR logics: 0 Number of Slice LUTs: 50,798 out of 92,152 55% Number used as logic: 35,040 out of 92,152 38% Number using O6 output only: 15,507 Number using O5 output only: 581 Number using O5 and O6: 18,952 Number used as ROM: 0 Number used as Memory: 3,297 out of 21,680 15% Number used as Dual Port RAM: 0 Number used as Single Port RAM: 0 Number used as Shift Register: 3,297 Number using O6 output only: 449 Number using O5 output only: 0 Number using O5 and O6: 2,848 Number used exclusively as route-thrus: 12,461 Number with same-slice register load: 12,036 Number with same-slice carry load: 425 Number with other load: 0
Slice Logic Distribution: Number of occupied Slices: 15,049 out of 23,038 65% Nummber of MUXCYs used: 22,144 out of 46,076 48% Number of LUT Flip Flop pairs used: 58,734 Number with an unused Flip Flop: 959 out of 58,734 1% Number with an unused LUT: 7,936 out of 58,734 13% Number of fully used LUT-FF pairs: 49,839 out of 58,734 84% Number of slice register sites lost to control set restrictions: 0 out of 184,304 0%
|
|
|
|
Khertan
|
 |
May 07, 2013, 03:11:23 PM |
|
Given the tiny returns from mining on the Nano, my opinion was that its not worth risking the boards at the higher speeds. I'm happy with my current setup (as described above) as nothing is getting above 60C, but its your call on your own stuff.
Regards Mark
I've tryed to fit a 2 loop with a 32 hasher, this could be fit in a DE0 Nano, after some auto magic Quartus Area optimization, but with a far less fmax (120Mhz). That s fit with only few 1xx lut free  Unfortunatly i mess up the things, as trying to convert things to two loop i break something in the cnt or feedback ... 
|
|
|
|
kramble
|
 |
May 07, 2013, 04:24:50 PM Last edit: May 07, 2013, 05:38:43 PM by kramble |
|
I've tryed to fit a 2 loop with a 32 hasher, this could be fit in a DE0 Nano, after some auto magic Quartus Area optimization, but with a far less fmax (120Mhz). That s fit with only few 1xx lut free  Unfortunatly i mess up the things, as trying to convert things to two loop i break something in the cnt or feedback ...  I was not able to get the LOOP_LOG2=2 code to fit myself but makomk achieved 27.5MH/s on a Nano ( https://bitcointalk.org/index.php?topic=74749.msg847182#msg847182 (EDIT updated to a better link)), so I guess that with some expert tweaking it does indeed work. I decided to go a different route and try to fit 22 hashers (which nicely gives 66 stages in three rounds, so just discarding the last two to give the 64 needed) using a variant of sha256_transform from makomk's github (since the makomk branch in the official distribution does not work unless LOOP_LOG2=1). It did take a fair bit of tinkering in the simulator to get the timing right (and I ended up discarding makomk's pipelining of the K values since it was too confusing, so there is an opportunity for some further gain by putting it back in). Interestingly this 66 round core generalized quite well as I was able to use it on a EP4CE10 as 6 rounds of 11 hashers and on an LX9 as 11 rounds of 6 hashers (rather disappointing utilization, but I'm even more of a novice at Xilinx ISE as I am at Quartus). Anyway this was just playing around for the sake of it rather than a serious attempt to build a miner on these devices, though I did construct one of each using TQFP devices built on breakout adapter's, which are currently hashing away at the majestic rates of 12.7MH/s 11.7MH/s (140MHz) and 5MH/s (110MHz) respectively. Best of luck Mark
|
|
|
|
|