disclaimer201 (OP)
Legendary
Offline
Activity: 1526
Merit: 1001
|
|
August 04, 2012, 11:05:44 AM |
|
Hey guys, I know this doesn't belong to BTC mining, but the hardware section made most sense imo.
Another one of my PSUs died recently, and I'm not sure if it makes sense to rebuild the rig for BTC and/or LTC mining. If possible I'd love to switch to FPGA mining but with Asics on the horizon I'd only buy an FPGA that is able to mine Litecoins as well as Bitcoins (not at the same time obviously).
Anyone have plans on FPGA with memory cache for Ltc-mining?
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
August 04, 2012, 03:43:30 PM |
|
|
|
|
|
disclaimer201 (OP)
Legendary
Offline
Activity: 1526
Merit: 1001
|
|
August 04, 2012, 09:05:26 PM |
|
128 MB will not be enough to get a good performance out of it, or? And a working bitstream for mining would be needed as well, right? Maybe I'll ask again next year.
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
August 04, 2012, 09:14:54 PM Last edit: August 05, 2012, 02:01:15 PM by 2112 |
|
128 MB will not be enough to get a good performance out of it, or? And a working bitstream for mining would be needed as well, right? Maybe I'll ask again next year.
One thread of scrypt(1024,1,1) requires exactly 131583 bytes of memory, which is 128.5 kB. Thus 128MB would allow for pipelining of over 1000 parallel scrypt() threads.
|
|
|
|
pieppiep
|
|
August 05, 2012, 07:07:37 AM |
|
I think I would round it up to 256 kB so you only need to make address lines for within the 256 kB and other address lines for the threads. That way you don't have to calculate where you must read by doing thread number * 128.5 kB. But that would still give the possibility of 512 parallel scrypt threads with that amount of memory.
|
|
|
|
bitfury
|
|
August 06, 2012, 07:45:56 AM |
|
Well, scrypt's scratchpad is 1024 x 1024 matrix. there are two loops, causing major slowdown:
1st loop - for i from 0 to 1023 - filling scratchpad doing: scratchpad[1023..0] <= X[1023..0]; X[511..0] <= xor_salsa(X[511..0], X[1023..512]); X[1023..512] <= xor_salsa(X[1023..512], X[511..0]);
2nd loop - use scratchpad for i from 0 to 1023 X[1023..0] <= X[1023..0] xor scratchpad[X[521..512]][1023..0]; X[511..0] <= xor_salsa(X[511..0], X[1023..512]); X[1023..512] <= xor_salsa(X[1023..512], X[511..0]);
While xor_salsa could be perfectly pipelined, in Spartan6 XC6SLX150 fits only 8 scratchpads. If BRAMs are not used for bitcoin computations, it is possible to implement LTC mining for XC6SLX150 at about 50 - 100 kh/s per chip with about 80% of slices free. So single chip can mine both - LTC and BTC using different of its internal resources - BRAMs for LTC and logics for BTC.
What is interesting to note - that scratchpad access could be perfectly pipelined as well, and is 1024-bit wide. That means that imaginable FPGA should have only 6 wires to transmit out address (6 bits + clock) and get 1024 input wires for scratchpad data.
This means that multiple smaller DRAM chips working in parallel will do best job... Allowing about 500 mega-transfers for low-cost / mid-cost fpga, that is 500 giga-bits per second or 60 gigabytes per second. Overall cost of DRAM will be about 150 EUR- and of FPGA to handle that about 300 EUR-. If works in fully-pipelined manner it would give about 500 kh/s mining performance for litecoin application.
Generally performances achieved near the same for litecoin as for decent GPU boards with FPGA, but power consumption would be radically less than for SHA256 bitcoin mining for example. Power dissipation would be very low. That is only point. Cost to build solution would be higher.
What is more interesting, that there will be no cheap way of ASIC for LTC purpose, as basically most of chip area would be RAM, and there will be no significant edge to produce RAM for pipelining using say 250-nm or 90-nm tech process. But - building cheap 250-nm chips for computations and to drive DRAM arrays would give significant cost reduction compared to installing FPGAs. Still - DRAM prices will not go anywhere and best DRAM-based solution would not outperform GPUs or CPUs by orders of magnitude.
Say for Scrypt it is best to pipeline about 32-36 calculations deep, not 1024... That would make xor_salsa calculations and DRAM access performances comparable.
Best on-die solution should contain 1024-bit wide (bus) and 32768-bit tall DRAM block - that will be biggest thing. For example for 90nm - 90nm is smallest feature size, while such single transistor. Single holding cell with routing area would have size about 0.5 um^2. So overall chip area would be ~16 mm^2 (!) without self-healing features. And computation unit size would be below 0.2 mm^2 :-)
For 90-nm that still requires $500k initial investments to build masks + investments into design, etc... And you'll get at about $1 per die price for chip that could compute 124 kh/s. Power consumption will be neglible - about 0.1 - 0.5 W
What is more interesting that 180nm would require ~$150k-$200k initial investments and will lead to $4 / die price (die will be bigger) and about 60 kh/s performance.
And 250-nm would require _much_ less - of about $50k-$80k initial investments and will lead to $8 / die price (die will be very big! 128 mm^2!) with about 40 kh/s performance.
So I would consider 180-nm to 250-nm for LTC ASIC. big die maybe not that bad, as that die can be mounted without packaging easily (11 mm x 11 mm is really big!).
Well - these numbers are very preliminary... I am currently learning into ASIC design, I think I would design scrypt hasher chip as well - 250-nm requires really small amounts of money to start with, and maybe there's something could be invented as well for speedup - say scrypt accesses memory randomly only in second part of using scratchpad, but when generating it access memory sequentially. Hmm ... maybe even Litecoin chip would be done before Bitcoin chip, as it seems to be much simpler and uses well-known techniques - less competition :-)
|
|
|
|
wizzardTim
Legendary
Offline
Activity: 1708
Merit: 1000
Reality is stranger than fiction
|
|
March 19, 2013, 09:40:07 PM |
|
Any news on this? have you designed the chip?
|
Behold the Tangle Mysteries! Dare to know It's truth.
- Excerpt from the IOTA Sacred Texts Vol. I
|
|
|
loshia
Legendary
Offline
Activity: 1610
Merit: 1000
|
|
March 20, 2013, 08:27:24 AM |
|
I was wandering is it possible for a miner software + bitstream to use external RAM resource? Something like PC RAM. We can put a much as we want. Having it as commission bit stream is ok for me. There are a lot of Spartans out there and if this is possible at all, whoever makes it will be rewarded for sure Any comments?
|
|
|
|
2112
Legendary
Offline
Activity: 2128
Merit: 1073
|
|
March 20, 2013, 12:08:19 PM |
|
I was wandering is it possible for a miner software + bitstream to use external RAM resource? Something like PC RAM. We can put a much as we want.
Spartan-6 memory controller blocks are designed to control single memory chips, not multi-chip memory modules. There are 4 memory controller blocks in each Spartan-6, but depending on the package not all are connected to pins. While it isn't impossible to build a memory-module controller from the regular Spartan-6 logic blocks, such controller will be inefficient and slow. In the Xilinx product line only Virtex FPGA can directly interface with multi-chip memory modules. BTW, bitfury is very busy with his 55nm Bitcoin ASIC project: https://bitcointalk.org/index.php?topic=140366.msg1641318#msg1641318
|
|
|
|
loshia
Legendary
Offline
Activity: 1610
Merit: 1000
|
|
March 20, 2013, 12:12:06 PM |
|
I was wandering is it possible for a miner software + bitstream to use external RAM resource? Something like PC RAM. We can put a much as we want.
Spartan-6 memory controller blocks are designed to control single memory chips, not multi-chip memory modules. There are 4 memory controller blocks in each Spartan-6, but depending on the package not all are connected to pins. While it isn't impossible to build a memory-module controller from the regular Spartan-6 logic blocks, such controller will be inefficient and slow. In the Xilinx product line only Virtex FPGA can directly interface with multi-chip memory modules. BTW, bitfury is very busy with his 55nm Bitcoin ASIC project: https://bitcointalk.org/index.php?topic=140366.msg1641318#msg164131810X Buy the way i am watching bitfury closely long time ago:)
|
|
|
|
pieppiep
|
|
March 20, 2013, 12:25:34 PM |
|
Remember, litecoin doesn't use the memory bandwidth, it uses the L1 (L2?) cache bandwidth, which is much higher.
|
|
|
|
daybyter
Legendary
Offline
Activity: 965
Merit: 1000
|
|
March 20, 2013, 01:05:42 PM |
|
|
|
|
|
wizzardTim
Legendary
Offline
Activity: 1708
Merit: 1000
Reality is stranger than fiction
|
|
March 20, 2013, 01:55:44 PM |
|
What exactly can we do with this board? It says it has advanced memory interfacing. Can we use it for mining LTCs?
|
Behold the Tangle Mysteries! Dare to know It's truth.
- Excerpt from the IOTA Sacred Texts Vol. I
|
|
|
daybyter
Legendary
Offline
Activity: 965
Merit: 1000
|
|
March 20, 2013, 03:14:51 PM |
|
You can add up to 4 GB ram. I thought that might be sufficent for an ltc lookup table.
|
|
|
|
tacotime
Legendary
Offline
Activity: 1484
Merit: 1005
|
|
March 20, 2013, 04:13:35 PM |
|
Any news on this? have you designed the chip?
These are theoretical numbers... laSeek has been busting his ass to try to get kilohash/second rates into the double digits with inexpensive FPGAs. The trials in altera FPGAs were a trainwreck. The problem is that even with a large number of slices, you will run into the problem that 1) Memory bandwidth in FPGA devices is poor comparative to a GPU. For on-slice cache it is 10-20x less than that of a GPU, and for off-chip memory it is about 20-40x less than a GPU. 2) Clock rate of FPGA devices in general is lower than that of GPUs. You can resolve 1) by chaining memory interfaces in a multichip configuration, but that's a lot of hardware customization.
|
XMR: 44GBHzv6ZyQdJkjqZje6KLZ3xSyN1hBSFAnLP6EAqJtCRVzMzZmeXTC2AHKDS9aEDTRKmo6a6o9r9j86pYfhCWDkKjbtcns
|
|
|
wizzardTim
Legendary
Offline
Activity: 1708
Merit: 1000
Reality is stranger than fiction
|
|
March 20, 2013, 04:19:02 PM |
|
You can add up to 4 GB ram. I thought that might be sufficent for an ltc lookup table.
That's good news. I do not think the RAM will be expensive, so who should we ask to get better info? Have you any insights on this: how much hashes it will produce, what is needed for programming the board to be able to mine with scrypt. I' m a software engineer, but I've never programmed a board..
|
Behold the Tangle Mysteries! Dare to know It's truth.
- Excerpt from the IOTA Sacred Texts Vol. I
|
|
|
daybyter
Legendary
Offline
Activity: 965
Merit: 1000
|
|
March 20, 2013, 05:39:18 PM |
|
That's exactly my problem. I write software and programmed pal's etc many years ago. But never programmed an fpga. I got a link to a sha256 implementation in vhdl (I know, that's not what ltc requires), and I compared it to the C sources just to get an idea, how similar they look. And at a first glance you can port the C sources almost 1:1. But I guess the devil is in the detail, so I won't claim, that a scrypt port is no problem. I wondered, if it's feasable to simulate the whole hardware, before any money is spent on prototype boards? But maybe the dev software will cost quite some money alone....don't know...
|
|
|
|
tacotime
Legendary
Offline
Activity: 1484
Merit: 1005
|
|
March 20, 2013, 06:01:38 PM |
|
LaSeek has been running simulations like crazy, they come out fast, but after synthesis they run very slow so far.
|
XMR: 44GBHzv6ZyQdJkjqZje6KLZ3xSyN1hBSFAnLP6EAqJtCRVzMzZmeXTC2AHKDS9aEDTRKmo6a6o9r9j86pYfhCWDkKjbtcns
|
|
|
wizzardTim
Legendary
Offline
Activity: 1708
Merit: 1000
Reality is stranger than fiction
|
|
March 21, 2013, 09:05:11 AM |
|
LaSeek has been running simulations like crazy, they come out fast, but after synthesis they run very slow so far.
Is there any chance that the slow speed comes from the simulation itself? What if we tried it on a real board. Would the results be similar or way different (better)?
|
Behold the Tangle Mysteries! Dare to know It's truth.
- Excerpt from the IOTA Sacred Texts Vol. I
|
|
|
tacotime
Legendary
Offline
Activity: 1484
Merit: 1005
|
|
March 21, 2013, 07:47:55 PM |
|
The slow speeds are on real chips. The simulations are what runs fast.
They're working on a lot of optimizations for the N=1024, p=1, r=1 scenario that is the current implementation. I think it's more of a technical challenge for laSeek as an FPGA engineer than anything else. It'll be interesting to see if he gets it off the ground.
|
XMR: 44GBHzv6ZyQdJkjqZje6KLZ3xSyN1hBSFAnLP6EAqJtCRVzMzZmeXTC2AHKDS9aEDTRKmo6a6o9r9j86pYfhCWDkKjbtcns
|
|
|
|