HardwareCollector
Member
Offline
Activity: 144
Merit: 10
|
|
May 05, 2018, 07:11:58 PM |
|
@GPUHoarder
What has been your experience with the latest HLS tools from Xilinx? A few years ago, the performance of their sample Bitcoin miner was atrocious. A $2K board with the throughput of 80 MH/s, not sure why they even included that sample.
We are low level RTL here, so I haven’t messed with much of the HLS side. The big issue with FPGAs is people used to buying graphics cards that are all commoditized close to a low MSRP don’t understand that the FPGA market is nothing like this. The traditional market for these things was Avionics, Defense, Government, Nation States, etc. the price on digikey is no where near the actual price for someone building their own hardware, and the dev kits are expensive because they come with every bell and whistle you might want to use with a particular chip and they are low low volume. This is changing a bit but these VCU1525 dev accelerators are no different than Nvidia’s P100/V100 PCIe card’s or AMD’s old FirePro, now Radeon Pro multi-thousand dollar cards. They don’t cost 10x more to make, the target market is just far less cost sensitive. The gap between ASIC and FPGA is also much narrower now than it was a few years ago. If a market emerges to consume 100,000 FPGA accelerators for mining, someone building their own hardware with a good relationship with Xilinx or Intel can and will drive the cost down at least an order of magnitude. Thanks for the insight, much appreciated.
|
|
|
|
gameboy366
Jr. Member
Offline
Activity: 252
Merit: 8
|
|
May 05, 2018, 07:29:09 PM |
|
@GPUHoarder
What has been your experience with the latest HLS tools from Xilinx? A few years ago, the performance of their sample Bitcoin miner was atrocious. A $2K board with the throughput of 80 MH/s, not sure why they even included that sample.
We are low level RTL here, so I haven’t messed with much of the HLS side. The big issue with FPGAs is people used to buying graphics cards that are all commoditized close to a low MSRP don’t understand that the FPGA market is nothing like this. The traditional market for these things was Avionics, Defense, Government, Nation States, etc. the price on digikey is no where near the actual price for someone building their own hardware, and the dev kits are expensive because they come with every bell and whistle you might want to use with a particular chip and they are low low volume. This is changing a bit but these VCU1525 dev accelerators are no different than Nvidia’s P100/V100 PCIe card’s or AMD’s old FirePro, now Radeon Pro multi-thousand dollar cards. They don’t cost 10x more to make, the target market is just far less cost sensitive. The gap between ASIC and FPGA is also much narrower now than it was a few years ago. If a market emerges to consume 100,000 FPGA accelerators for mining, someone building their own hardware with a good relationship with Xilinx or Intel can and will drive the cost down at least an order of magnitude. Can workstation cards become useful for mining ? Damn ! They're gonna teach us FPGA next year in college.
|
|
|
|
GPUHoarder
Member
Offline
Activity: 154
Merit: 37
|
|
May 05, 2018, 07:39:39 PM |
|
Can workstation cards become useful for mining ? Damn ! They're gonna teach us FPGA next year in college.
You mean the FirePro type GPUs? Definitely. I have many many older FirePro cards that had plenty of memory bandwidth but underpowered (or over powered depending how you look at it) cores compared to the current gen. This cards got large page table support and some other niceties because of who the customers were. The also support an FPGA on the PCIe bus directly writing into / readying from their GPU memory (albeit at PCIe speeds). This works great with something like ETH, because you can do all the compute heavy Keccak off GPU but still use all the memory bandwidth.
|
|
|
|
gameboy366
Jr. Member
Offline
Activity: 252
Merit: 8
|
|
May 05, 2018, 08:13:51 PM |
|
Can workstation cards become useful for mining ? Damn ! They're gonna teach us FPGA next year in college.
You mean the FirePro type GPUs? Definitely. I have many many older FirePro cards that had plenty of memory bandwidth but underpowered (or over powered depending how you look at it) cores compared to the current gen. This cards got large page table support and some other niceties because of who the customers were. The also support an FPGA on the PCIe bus directly writing into / readying from their GPU memory (albeit at PCIe speeds). This works great with something like ETH, because you can do all the compute heavy Keccak off GPU but still use all the memory bandwidth. Then why I don't see anyone using them for mining. They are widely available compared to FPGAs. How do they compare to current gen GPUs. I have some old Quadro cards at home eating dust, wish they become useful.
|
|
|
|
philipma1957
Legendary
Offline
Activity: 4298
Merit: 8768
'The right to privacy matters'
|
|
May 05, 2018, 09:12:20 PM |
|
Let’s play the devil’s advocate.
While all of these sounds good, there are some risks that you can’t afford to ignore:
Each hardware that is brought online, decreases profitability for all other existing hardware (based on hash rate). Everyone is fighting for a piece of the same pie.
The early adopter has the shortest time to ROI before the coin becomes saturated with hash rate. Things always look good if you ignore the eventual increase in global hash rate.
You are at the developer’s mercy for releasing FPGA bitstreams if you can’t design your own. Variations in performance can vary widely between public and private designs. You are locked to a few small coins and someone with a better algo implementation can decimate your ROI time.
And the biggest risk of them all is AWS F1 instances. As long it’s profitable to mine on AWS or other cloud providers, people will do so. Amazon does not shut down your instances when you are upfront with them, and how do I know this? They care about your ability to pay bills and follow their ToS.
Just playing the devil's advocate.
this is a major concern. Just how long before my 5x 5000 = 25000 fpga get the software upgrades I need. There was some safety in a world of 200,000 gpus on a coin since many wanted some software and a developer Claymore for instance could simply take 2% off the top. In this world of fgpa there will not be 200,000 pga's screaming please give me new software upgrades. and as I said if the coin changes algos a lot even the pga guy will struggle to keep up with the switches. and if the coin never changes the asic builder will whale. so the fpga guy will need just enough algo changes to be okay. I suspect the fpga will be good for some coins the asic guy will be good for some coins the gpu will still do okay.
|
|
|
|
s1gs3gv
Legendary
Offline
Activity: 1316
Merit: 1014
ex uno plures
|
|
May 05, 2018, 10:48:28 PM |
|
Folks all these predictions and calculations are fun to read but at the end of the day the 3rd law of crypto-dynamics states that the value of a token asymptotically approaches the cost of mining it, plus a small profit related to the cost of capital. Ultimately mining economics become like a utility. I predict that in the not too distant future governments will secure key blockchains with public money as a public service.
And no form of private mining will be profitable. Cpu, Gpu, Fpga or Asic.
|
|
|
|
suchmoon
Legendary
Offline
Activity: 3836
Merit: 9059
https://bpip.org
|
|
May 05, 2018, 11:18:15 PM |
|
Folks all these predictions and calculations are fun to read but at the end of the day the 3rd law of crypto-dynamics states that the value of a token asymptotically approaches the cost of mining it, plus a small profit related to the cost of capital. Ultimately mining economics become like a utility. I predict that in the not too distant future governments will secure key blockchains with public money as a public service.
And no form of private mining will be profitable. Cpu, Gpu, Fpga or Asic.
And just to make things run smoother the governments may allow certain private entities (let's call them Blockchain Authorized Node Clusters) to create any quantities of coins/tokens as needed...
|
|
|
|
senseless
|
|
May 05, 2018, 11:18:58 PM |
|
at least an order of magnitude.
As someone who has sat down with a regional vice president of sales for Xilinx. Don't hold your breath. You'd be surprised at what amazon, tencent, etc paid for their XCVU9P's. Best case you're looking at a reduction of about 65% if order volume is greater than $10M lump sum.
|
|
|
|
senseless
|
|
May 05, 2018, 11:25:28 PM |
|
I won’t say it’s impossible, but I would be really genuinely surprised. 32MB / hash of total bandwidth (read + write) is needed, and 2MB or so of stashes per hashcore.
You have 1280 URAM blocks of 288kb by 72 bit interface dual ported in the biggest configuration .That’s an incredible amount of internal bandwidth but you can only store 23 or so simultaneous Cryptonight7 2MB blocks in that. The absolute biggest part (which isn’t on the 1525 board) has 360Mbit URAM, 96Mbit BRAM, and 48Mbit Distributed RAM, holding a theoretical 63 MB of pipelines, assuming you didn’t need a single bit of that for the rest of your logic (you do).
The external memory at say 4x64 DIMMs @2666 is only 85GB/s, or 2.6 KH worth of bandwidth with a perfect access pattern.
Even if you could imaginarily use all 2000+ balls on the FPGA for 2666 MT/s DDR style speeds you’d still only clear 20KH against external memory and that isn’t even real bandwidth.
Even if you took the biggest part with 128x32 Gbps transceivers to SERDES memory you’d only have 16kH limit from bandwidth.
Unless you break the algorithm itself, there’s no where to find the bandwidth + storage space for 64khs on a single FPGA.
You're missing a really big part of the ultraram. One of the most attractive things that ultraram has to offer. True dual port single clock read/write. Also, when you chain ultrarams together it increases the bus width proportionally to the amount it increases the latency. I never completed monero but my estimates were in the 4-8Kh/s per board range at 100W. The one I'm having a hard time believing is keccak. The VCU1525 only has 160A vccint. I was hitting 2.5Gh/s at 140-150A vccint with 12 cores operating at 225mhz. This wasn't optimized, like at all, but I'm having a hard time finding 7x worth of hashrate with optimization. Then again, I'm only using vivado and didn't spend a great deal of time on optimization.
|
|
|
|
GPUHoarder
Member
Offline
Activity: 154
Merit: 37
|
|
May 05, 2018, 11:28:03 PM |
|
I won’t say it’s impossible, but I would be really genuinely surprised. 32MB / hash of total bandwidth (read + write) is needed, and 2MB or so of stashes per hashcore.
You have 1280 URAM blocks of 288kb by 72 bit interface dual ported in the biggest configuration .That’s an incredible amount of internal bandwidth but you can only store 23 or so simultaneous Cryptonight7 2MB blocks in that. The absolute biggest part (which isn’t on the 1525 board) has 360Mbit URAM, 96Mbit BRAM, and 48Mbit Distributed RAM, holding a theoretical 63 MB of pipelines, assuming you didn’t need a single bit of that for the rest of your logic (you do).
The external memory at say 4x64 DIMMs @2666 is only 85GB/s, or 2.6 KH worth of bandwidth with a perfect access pattern.
Even if you could imaginarily use all 2000+ balls on the FPGA for 2666 MT/s DDR style speeds you’d still only clear 20KH against external memory and that isn’t even real bandwidth.
Even if you took the biggest part with 128x32 Gbps transceivers to SERDES memory you’d only have 16kH limit from bandwidth.
Unless you break the algorithm itself, there’s no where to find the bandwidth + storage space for 64khs on a single FPGA.
You're missing a really big part of the ultraram. One of the most attractive things that ultraram has to offer. True dual port single clock read/write. Also, when you chain ultrarams together it increases the bus width proportionally to the amount it increases the latency. I never completed monero but my estimates were in the 4-8Kh/s per board range at 100W. [/quote I wasn’t missing it - I even mentioned it. I know my FPGA prices are more than an order of magnitude below the “digikey” prices. The true dual port is truly the reason it works as well as it does - write completion before read in the dependency chaining. The issue (at least with cryptonight) isn’t the bandwidth at all, it is the amount available. Ultraram is great in general. On the Keccak, remember most crypto doesn’t need anywhere near the full 1600bit of initial or final state. Trace through those dependency chains over 24 rounds and you eliminate a lot of calculation.
|
|
|
|
senseless
|
|
May 05, 2018, 11:30:54 PM |
|
I won’t say it’s impossible, but I would be really genuinely surprised. 32MB / hash of total bandwidth (read + write) is needed, and 2MB or so of stashes per hashcore.
You have 1280 URAM blocks of 288kb by 72 bit interface dual ported in the biggest configuration .That’s an incredible amount of internal bandwidth but you can only store 23 or so simultaneous Cryptonight7 2MB blocks in that. The absolute biggest part (which isn’t on the 1525 board) has 360Mbit URAM, 96Mbit BRAM, and 48Mbit Distributed RAM, holding a theoretical 63 MB of pipelines, assuming you didn’t need a single bit of that for the rest of your logic (you do).
The external memory at say 4x64 DIMMs @2666 is only 85GB/s, or 2.6 KH worth of bandwidth with a perfect access pattern.
Even if you could imaginarily use all 2000+ balls on the FPGA for 2666 MT/s DDR style speeds you’d still only clear 20KH against external memory and that isn’t even real bandwidth.
Even if you took the biggest part with 128x32 Gbps transceivers to SERDES memory you’d only have 16kH limit from bandwidth.
Unless you break the algorithm itself, there’s no where to find the bandwidth + storage space for 64khs on a single FPGA.
You're missing a really big part of the ultraram. One of the most attractive things that ultraram has to offer. True dual port single clock read/write. Also, when you chain ultrarams together it increases the bus width proportionally to the amount it increases the latency. I never completed monero but my estimates were in the 4-8Kh/s per board range at 100W. I wasn’t missing it - the true dual port is truly the reason it works as well as it does - write completion before read in the dependency chaining. The issue (at least with cryptonight) isn’t the bandwidth at all, it is the amount available. Ultraram is great in general. There's also a bunch of block ram and distributed ram. AES itself is tiny (<30K luts) and the secondary hashes don't need to be completed on the FPGA (meaning, you don't really need to put groestl, jh, etc on the fpga, you can just read the 8Kh/s and complete the secondary on CPU). While I have those completed (the secondaries), I had never intended on putting them on the fpga for cryptonight.
|
|
|
|
philipma1957
Legendary
Offline
Activity: 4298
Merit: 8768
'The right to privacy matters'
|
|
May 05, 2018, 11:32:31 PM |
|
Folks all these predictions and calculations are fun to read but at the end of the day the 3rd law of crypto-dynamics states that the value of a token asymptotically approaches the cost of mining it, plus a small profit related to the cost of capital. Ultimately mining economics become like a utility. I predict that in the not too distant future governments will secure key blockchains with public money as a public service.
And no form of private mining will be profitable. Cpu, Gpu, Fpga or Asic.
Yep I could see this happening. I figure a few dozen coins are enough if it goes fully under government controls.
|
|
|
|
GPUHoarder
Member
Offline
Activity: 154
Merit: 37
|
|
May 05, 2018, 11:34:43 PM |
|
I won’t say it’s impossible, but I would be really genuinely surprised. 32MB / hash of total bandwidth (read + write) is needed, and 2MB or so of stashes per hashcore.
You have 1280 URAM blocks of 288kb by 72 bit interface dual ported in the biggest configuration .That’s an incredible amount of internal bandwidth but you can only store 23 or so simultaneous Cryptonight7 2MB blocks in that. The absolute biggest part (which isn’t on the 1525 board) has 360Mbit URAM, 96Mbit BRAM, and 48Mbit Distributed RAM, holding a theoretical 63 MB of pipelines, assuming you didn’t need a single bit of that for the rest of your logic (you do).
The external memory at say 4x64 DIMMs @2666 is only 85GB/s, or 2.6 KH worth of bandwidth with a perfect access pattern.
Even if you could imaginarily use all 2000+ balls on the FPGA for 2666 MT/s DDR style speeds you’d still only clear 20KH against external memory and that isn’t even real bandwidth.
Even if you took the biggest part with 128x32 Gbps transceivers to SERDES memory you’d only have 16kH limit from bandwidth.
Unless you break the algorithm itself, there’s no where to find the bandwidth + storage space for 64khs on a single FPGA.
You're missing a really big part of the ultraram. One of the most attractive things that ultraram has to offer. True dual port single clock read/write. Also, when you chain ultrarams together it increases the bus width proportionally to the amount it increases the latency. I never completed monero but my estimates were in the 4-8Kh/s per board range at 100W. I wasn’t missing it - the true dual port is truly the reason it works as well as it does - write completion before read in the dependency chaining. The issue (at least with cryptonight) isn’t the bandwidth at all, it is the amount available. Ultraram is great in general. There's also a bunch of block ram and distributed ram. AES itself is tiny (<30K luts) and the secondary hashes don't need to be completed on the FPGA (meaning, you don't really need to put groestl, jh, etc on the fpga, you can just read the 8Kh/s and complete the secondary on CPU). While I have those completed (the secondaries), I had never intended on putting them on the fpga for cryptonight. Im not sure you actually read my posts on the topic, as you’re repeating a few things I already stated - such as I don’t do secondary hashes on the FPGA. 30k LUTs for AES? That’s a huge amount more than my cores...
|
|
|
|
senseless
|
|
May 05, 2018, 11:38:09 PM |
|
I won’t say it’s impossible, but I would be really genuinely surprised. 32MB / hash of total bandwidth (read + write) is needed, and 2MB or so of stashes per hashcore.
You have 1280 URAM blocks of 288kb by 72 bit interface dual ported in the biggest configuration .That’s an incredible amount of internal bandwidth but you can only store 23 or so simultaneous Cryptonight7 2MB blocks in that. The absolute biggest part (which isn’t on the 1525 board) has 360Mbit URAM, 96Mbit BRAM, and 48Mbit Distributed RAM, holding a theoretical 63 MB of pipelines, assuming you didn’t need a single bit of that for the rest of your logic (you do).
The external memory at say 4x64 DIMMs @2666 is only 85GB/s, or 2.6 KH worth of bandwidth with a perfect access pattern.
Even if you could imaginarily use all 2000+ balls on the FPGA for 2666 MT/s DDR style speeds you’d still only clear 20KH against external memory and that isn’t even real bandwidth.
Even if you took the biggest part with 128x32 Gbps transceivers to SERDES memory you’d only have 16kH limit from bandwidth.
Unless you break the algorithm itself, there’s no where to find the bandwidth + storage space for 64khs on a single FPGA.
You're missing a really big part of the ultraram. One of the most attractive things that ultraram has to offer. True dual port single clock read/write. Also, when you chain ultrarams together it increases the bus width proportionally to the amount it increases the latency. I never completed monero but my estimates were in the 4-8Kh/s per board range at 100W. I wasn’t missing it - the true dual port is truly the reason it works as well as it does - write completion before read in the dependency chaining. The issue (at least with cryptonight) isn’t the bandwidth at all, it is the amount available. Ultraram is great in general. There's also a bunch of block ram and distributed ram. AES itself is tiny (<30K luts) and the secondary hashes don't need to be completed on the FPGA (meaning, you don't really need to put groestl, jh, etc on the fpga, you can just read the 8Kh/s and complete the secondary on CPU). While I have those completed (the secondaries), I had never intended on putting them on the fpga for cryptonight. Im not sure you actually read my posts on the topic, as you’re repeating a few things I already stated - such as I don’t do secondary hashes on the FPGA. 30k LUTs for AES? That’s a huge amount more than my cores... I was going through all the threads and scanning. I'll admit I didn't read everything. And ya, I'm a little high on my aes size estimate. How many are you currently operating or are you mainly leasing? It's kind of funny, I would lurk the aws forums and see all the questions from people who were obviously mining. No one would admit it, or talk about it, for fear of revealing the secret. Now it's out and it can be talked about -- Btw, to the op, you're wrong about AWS caring that people mine on their servers. You're not allowed to use shared resources for mining. Dedicated resources you can do whatever you want with. I've been mining on AWS and in regular contact with the AWS engineers running the F1 project since June 2017. I was also in the beta program.
|
|
|
|
GPUHoarder
Member
Offline
Activity: 154
Merit: 37
|
|
May 05, 2018, 11:40:52 PM |
|
We operate custom hardware. We’ve done the 100 device first spin, and are currently working on a full batch order > 1000 FPGAs for the Ultrascale+ end.
I haven’t posted here much, so I seem to have a lot of message reply and rate limits.
Regarding AWS - once the cat is out of the bag it won’t be profitable.
|
|
|
|
senseless
|
|
May 05, 2018, 11:42:34 PM |
|
We operate custom hardware. We’ve done the 100 device first spin, and are currently working on a full batch order > 1000 FPGAs for the Ultrascale+ end.
I haven’t posted here much, so I seem to have a lot of message reply and rate limits.
I'll send a PM with my skype details would like to talk further in private.
|
|
|
|
senseless
|
|
May 05, 2018, 11:50:54 PM |
|
Regarding AWS - once the cat is out of the bag it won’t be profitable.
Ya, i'm hoping that happens sooner rather than later. Let's get it out there and burn that bridge. You don't really want anyone to follow in your footsteps. Competition is bad to our bottom line -- Let them fight over the pennies and scraps like they do with the GPU markets. :p
|
|
|
|
mrb
Legendary
Offline
Activity: 1512
Merit: 1028
|
|
May 06, 2018, 03:09:01 AM |
|
whitefire990: really nice job in identifying & executing on this opportunity with FPGAs! I still have a bunch of Spartan6 LX150 FPGA hardware (from 2012, when FPGA-mining Bitcoin was a thing!) but I doubt this type of FPGA can be as profitable as your cards. Do you have a particular process for identifying which PoW algos are the most profitable on FPGAs? For example you implemented Phi1612... it is not even implemented by tpruvot's cpuminer. It's used by LUXCoin which is a rather obscure coin (#458 on https://coinmarketcap.com). Did you manually examine hundreds of PoW algorithms to find the best opportunities?
|
|
|
|
DigitalCruncher
Jr. Member
Offline
Activity: 59
Merit: 1
|
|
May 06, 2018, 03:29:10 AM |
|
Regarding AWS - once the cat is out of the bag it won’t be profitable.
Ya, i'm hoping that happens sooner rather than later. Let's get it out there and burn that bridge. You don't really want anyone to follow in your footsteps. Competition is bad to our bottom line -- Let them fight over the pennies and scraps like they do with the GPU markets. :p Which is the primary usage of Amazon AWS FPGA nodes? I mean, in the real life? I have and idea to find such application and build a FPGA-based blockchain around it. Could it be bioinformatics?
|
|
|
|
jimmykl
|
|
May 06, 2018, 05:23:42 AM |
|
Which is the primary usage of Amazon AWS FPGA nodes? I mean, in the real life? I have and idea to find such application and build a FPGA-based blockchain around it.
Could it be bioinformatics?
https://aws.amazon.com/ec2/instance-types/f1/- Genomics Research
- Financial Analytics
- Real Time Video Processing
- Big Data Search and Analytics
- Security
|
|
|
|
|