Bitcoin Forum
August 20, 2018, 08:03:12 AM *
News: Latest stable version of Bitcoin Core: 0.16.2  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 »
  Print  
Author Topic: Acorn M.2 FPGA based GPU Accelerator  (Read 20917 times)
GPUHoarder
Member
**
Offline Offline

Activity: 112
Merit: 31


View Profile
June 01, 2018, 06:06:38 AM
Merited by suchmoon (5), 64dimensions (5)
 #1

This information all existed in the discord but I wanted to share it with everyone.

So we’ve developed an FPGA accelerator over the past few months in M.2 (same as nVME drives) form factor designed to operate both standalone and in conjunction with GPUs.

The first version to be released has 4x high speed PCIe lanes to communicate between the system/GPUs as well as 512MB or 1GB of onboard DDR3 along with a 100k+ LE or 200k+ LE FPGA of high speed grade. We’ve named it the Acorn, and the three models are the CLE-101, CLE-215, and CLE-215+

General expectation is it will provide performance roughly scaled with price/performance of the VCU1525, but it has a unique role and is not applicable to all of the same algorithms. Its performance in this role is dominated by its interconnect bandwidth and not its processing power.

It is capable of providing up to 30MH of lift to a mining system with GPUs on a hand full of algorithms or operate independently at higher-than-GPU level hashrates for other non-memory intensive algorithms (Keccak, etc). I will be releasing it alongside our mining software and bitstreams to support hybrid GPU acceleration. This project was not developed commercially, it was developed out of a product for my day job for internal use in our own mining systems to give an edge to traditional PCs and gaming systems turned miners.

The accelerator works by streaming high bandwidth hash state between GPUs and the FPGA over PCIe., allowing each piece of hardware to handle the portion of the algorithm it is best at. In general this means memory bandwidth or area heavy portions of the algorithm may be handled by the GPU, and hash algorithms designed for hardware implementations are handled by the FPGA. This approach works for any algorithm whose internal state is 256 bit (60Mh gains) or 512 bit (x16r, Lyra2Rev2, etc.) or smaller. The accelerator supports rapidly reconfiguring its algorithms from on-board DDR to enable handling of per-block or period (TimeTravel10) re-sequencing. It was designed originally to provide performance gains (especially for older GPUs with poor cores) and power savings for ETH by way of offloading the opening and closing Keccak calculations, as well as hash-selection to improve locality of reference for early ETH rounds.

Given the anticipated path of ETH itself regarding POS and other fork possibilities please consider all those things if ETH is your target. It may be the most popular coin for GPUs, that does not mean it is the best use of FPGA or hybrid tech.

I’ve decided to make this hardware available to community at near cost, given all the FPGA interest lately, alongside my belief that broadly available general purpose acceleration hardware at its true market cost (not low volume industry specific dev boards) is the best defense against complete ASIC centralization. You will see this philosophy reflected in my activity around the VCU1525 board as well.

Anticipated pre-order prices of $199 for the CLE-101 512MB variant and $329 on the high end highest speed grade CLE-215+ 1GB DRAM version. On-board power consumption is nominally 15W. It will include a heatsink adequate for this dissipation level with reasonable airflow. It is important to note that to fit the FPGA this adapter is slightly outside of the 2280 M.2 specification, weighing in at 2380. The vast majority of M.2 slots should not have an issue with this.

I am also pursuing making available well priced options for individual PCIe x4 to M.2 M-key host boards (these are broadly available for $10-15), as well as Quad-M.2 PCIe switched and Bifurcated x16 host boards for those who do not have the available M.2 M-Key slots or require up to 240MH of acceleration.

I won’t post exact per algorithm stats or performance until I can do final testing of the actual boards to be shipped with the release hardware/heatsink/thermal management pieces in place, at which point I’ll accept pre-orders. This device requires quite a bit of testing to cover the list of common GPUs, PCIe configurations, and supported algorithms. I have no desire to sell anyone anything not useful to them, or to push a board at all, let alone one based on 3D renders, prototype parts pictures, or choppy YouTube videos, so I believe this full set of data along with final product pictures and overview must be published before I will take any preorders. I am sorry if that tests your patience.

Prototypes exist and I’ve already secured most of the hardware for a first batch so lead time will only be PCB + assembly.

At the time of shipping I will be releasing our internal miner software in closed source form for Windows and Linux that supports GPU only as well as Hybrid acceleration. You’re also welcome to develop your own bitstreams for the accelerator, and will have all the specifications necessary to do so.

I will also be publishing the interface for the bitstreams so that open source miners that wish to can use the FPGA directly.

We are handling all CE, FCC, RoHS, and other certifications as well as ITAR and export compliance, so we will be able to ship to all non-US embargo’d countries. Taxes and import duties will fall on the purchaser. We will be offering at least a 90 day warranty.

All feedback is welcome. This is not my source of income, nor that of the rest of my team, and we don’t want anyone’s money unless they are happy with what we’re offering. I’m also happy to continue conversations I am already having with coin devs and miner developers on how or if FPGAs fit into their plans for their coin and/or ASIC Resistance strategies. This community is about choice, and I will respect the choices of those teams.

So all I would like from all of you beyond the feedback, is for anyone interested to hit our pre-order registration survey at http://www.squirrelsresearch.com to help us ensure we’re covering your needs and wants and have all the appropriate hardware secured. Based on that info very detailed performance information and full device photos (spoiler - it looks like an SSD with a heatsink on it!) will be published at the time preorders open, expected in mid-June.

- David


Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
trillobeat
Jr. Member
*
Offline Offline

Activity: 34
Merit: 0


View Profile
June 01, 2018, 06:15:44 AM
 #2

Good news!

I want to ask again ,  is the use of the M.2 accelerator  helping only a single GPU on the motherboard or all  GPUs will benefit? 

The MBs with two M.2 slots can use two accelerator units ? ( in the case a single GPU per accelerator)
GPUHoarder
Member
**
Offline Offline

Activity: 112
Merit: 31


View Profile
June 01, 2018, 06:16:51 AM
 #3

Good news!

I want to ask again ,  is the use of the M.2 accelerator  helping only a single GPU on the motherboard or all  GPUs will benefit? 

The MBs with two M.2 slots can use two accelerator units ? ( in the case a single GPU per accelerator)

One Acorn can help multiple GPUs depending on algorithm, you can use two in two M.2 slots.
trillobeat
Jr. Member
*
Offline Offline

Activity: 34
Merit: 0


View Profile
June 01, 2018, 06:45:14 AM
 #4

Good news!

I want to ask again ,  is the use of the M.2 accelerator  helping only a single GPU on the motherboard or all  GPUs will benefit? 

The MBs with two M.2 slots can use two accelerator units ? ( in the case a single GPU per accelerator)

One Acorn can help multiple GPUs depending on algorithm, you can use two in two M.2 slots.

So an Asrock B250M Pro 4 mATx board which has 2 PCIe 3.0  x16 slots and 2 Ultra M.2 PCIe GEN 3.0 x4   will get two GPUs  fully boosted by two accelerators. 

In the case of a board with single x16 slot and say 4 PCIe x1 slots, all GPUs will benefit but to a lesser degree as the limitation on x1 slots?

 Is this correct?
GPUHoarder
Member
**
Offline Offline

Activity: 112
Merit: 31


View Profile
June 01, 2018, 07:14:20 AM
 #5

Good news!

I want to ask again ,  is the use of the M.2 accelerator  helping only a single GPU on the motherboard or all  GPUs will benefit? 

The MBs with two M.2 slots can use two accelerator units ? ( in the case a single GPU per accelerator)

One Acorn can help multiple GPUs depending on algorithm, you can use two in two M.2 slots.

So an Asrock B250M Pro 4 mATx board which has 2 PCIe 3.0  x16 slots and 2 Ultra M.2 PCIe GEN 3.0 x4   will get two GPUs  fully boosted by two accelerators. 

In the case of a board with single x16 slot and say 4 PCIe x1 slots, all GPUs will benefit but to a lesser degree as the limitation on x1 slots?

 Is this correct?

That’s correct.
KaydenC
Sr. Member
****
Offline Offline

Activity: 546
Merit: 256


Developer of EscrowMyEther dApp


View Profile WWW
June 01, 2018, 07:30:14 AM
 #6

How important is pcie 3.0 4x bandwidth for this FPGA? Because I use 8 gpu riserless Onda mobos; the only way it'll connect is via m2 key host boards on Pcie 1x lanes.


53 Eth Escrowed, I offer escrow service: https://bitcointalk.org/index.php?topic=2221107.0
Buying >$250 Bitmain coupons
GPUHoarder
Member
**
Offline Offline

Activity: 112
Merit: 31


View Profile
June 01, 2018, 08:14:01 AM
 #7

How important is pcie 3.0 4x bandwidth for this FPGA? Because I use 8 gpu riserless Onda mobos; the only way it'll connect is via m2 key host boards on Pcie 1x lanes.



Unfortunately if you don’t have any place with at least 4x PCIe 2.0 it will be difficult to use in the GPU hybrid role. You would be limited to standalone algorithms.
josywong
Newbie
*
Offline Offline

Activity: 29
Merit: 0


View Profile
June 01, 2018, 09:21:27 AM
 #8

reserve post. reserve a unit. non-us.
melpheos
Jr. Member
*
Offline Offline

Activity: 322
Merit: 3


View Profile
June 01, 2018, 09:35:33 AM
 #9

Do you have before testing a very very rough estimate of the acceleration for a few algorithms ?
Otherwise will wait for the test obviously Smiley
Dotem
Newbie
*
Offline Offline

Activity: 8
Merit: 0


View Profile
June 01, 2018, 10:19:43 AM
 #10

Very interesting.  I look forward to seeing the data.  Watching.
FFI2013
Hero Member
*****
Offline Offline

Activity: 738
Merit: 500


View Profile
June 01, 2018, 11:28:39 AM
 #11

How important is pcie 3.0 4x bandwidth for this FPGA? Because I use 8 gpu riserless Onda mobos; the only way it'll connect is via m2 key host boards on Pcie 1x lanes.


I was looking at those mobos and wasn't sure the size of the gap between gpu's is it sufficient for cooling or do use a fan to help
gameboy366
Jr. Member
*
Offline Offline

Activity: 182
Merit: 8


View Profile
June 01, 2018, 11:47:23 AM
 #12

How important is pcie 3.0 4x bandwidth for this FPGA? Because I use 8 gpu riserless Onda mobos; the only way it'll connect is via m2 key host boards on Pcie 1x lanes.


I was looking at those mobos and wasn't sure the size of the gap between gpu's is it sufficient for cooling or do use a fan to help
Those Onda MoBos were made to be used inside server chasis style case which has many intake and outake fans.

-Ravencoin (RVN)
-ZCoin (XZC)
-EOS Classic (EOSC)
kjs
Full Member
***
Offline Offline

Activity: 185
Merit: 105


View Profile
June 01, 2018, 12:09:38 PM
 #13

Survey completed.
KaydenC
Sr. Member
****
Offline Offline

Activity: 546
Merit: 256


Developer of EscrowMyEther dApp


View Profile WWW
June 01, 2018, 12:39:19 PM
 #14

How important is pcie 3.0 4x bandwidth for this FPGA? Because I use 8 gpu riserless Onda mobos; the only way it'll connect is via m2 key host boards on Pcie 1x lanes.


I was looking at those mobos and wasn't sure the size of the gap between gpu's is it sufficient for cooling or do use a fan to help

I run them in server cases with 3500rpm fans. But they should be fine in cooler climate with 1070ti or lower end cards. Spacing is better than the colorful or biostar riserless 8gpu boards.

How important is pcie 3.0 4x bandwidth for this FPGA? Because I use 8 gpu riserless Onda mobos; the only way it'll connect is via m2 key host boards on Pcie 1x lanes.



Unfortunately if you don’t have any place with at least 4x PCIe 2.0 it will be difficult to use in the GPU hybrid role. You would be limited to standalone algorithms.

Noted, that's unfortunate.

53 Eth Escrowed, I offer escrow service: https://bitcointalk.org/index.php?topic=2221107.0
Buying >$250 Bitmain coupons
Shnikes101
Full Member
***
Offline Offline

Activity: 234
Merit: 107


View Profile
June 01, 2018, 01:48:54 PM
 #15

I'd be interested in a few and would be willing to join pre-order. Haven't rolled the dice in a while. This could be fun.

yrk1957
Member
**
Offline Offline

Activity: 196
Merit: 14


View Profile
June 01, 2018, 02:05:13 PM
 #16

I'll be interested in 4, whenever they are ready to order.
mo35
Member
**
Offline Offline

Activity: 124
Merit: 10


View Profile
June 01, 2018, 02:12:08 PM
 #17

very interesting stuff , but some algo performance gains imo is must have info

s1gs3gv
Legendary
*
Offline Offline

Activity: 1260
Merit: 1014

ex uno plures


View Profile WWW
June 01, 2018, 02:23:33 PM
 #18

GPUHoarder
Member
**
Offline Offline

Activity: 112
Merit: 31


View Profile
June 01, 2018, 02:30:05 PM
 #19

-Will it increase hash of gpus that are on pcie x1.

-Is there any limitation to how many gpus it can handle and is there any decrease in performance if more gpus are used

-if two of these are used do we expect double the results

Also is there any MoBo that has more then one Pcie x16 that runs at full x16 speed.



It is better to use fewer GPUs, as long as your PCIe bandwidth andsupports it. You must be able to have at least 4x PCIe 2.0 lanes of bandwidth to the GPUs you are accelerating. That is 8 GPUs in 1x, which is not the ideal.

The maximum lift total per accelerator is around 30MH normally, as stated in post. There are some algorithms where there is 60. In all non-standalone cases the PCIe bandwidth becomes bottleneck before accelerator performance.



badfad
Jr. Member
*
Offline Offline

Activity: 172
Merit: 4


View Profile
June 01, 2018, 03:52:17 PM
 #20

I did the survey, let's see now.
Pages: [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 »
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!