Bitcoin Forum
November 11, 2024, 10:58:02 AM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1] 2 »  All
  Print  
Author Topic: Cell/B.E. LTC Superminer Project  (Read 1701 times)
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 07, 2013, 11:46:50 AM
Last edit: November 12, 2013, 08:49:21 AM by blasthash
 #1

Alright, guys, I'll lay this down.

It's been awhile since I've been actively mining, for a number of reasons (mainly an iPhone app I'm in the process of writing).

I would like to work on a custom miner project centered on mining Litecoins, but as I'm still 'new' to cryptocurrencies in the idea of how the P2P information channels work (and, let's face it, I'm not a cryptographer or cipher specialist, either), I would like some help. I'm a hardware engineer - hardware I understand, live, and breathe; software, not as much.

I will say this, and I'll bold it so after all the information y'all don't forget the potential gold in this: if anyone is interested in helping me with this project beyond simple advice and actually want to participate/contribute/work to develop this idea, I will make all the hardware plans an open-source project if you can help me write the software side of things.

Obviously, Scrypt is differentiated from the SHA-256 cipher due to it being memory-hard; it revolves around memory-intensive operations that are exceedingly difficult to implement in ASIC or FPGA grades that would be price-worthy to end-users. Because of that, CPUs generally are used due to them having the onboard 'L' caches as well as access to MUCH more memory.

Being a processor hardware designer primarily, I've wanted to mess with very-high-speed parallel arrays for awhile, but couldn't find chips that met the specs I was looking for (face it, we're all picky in some regard), until I stumbled upon the PS3's chipset.

As you might know, the Cell processor used in the PS3 is a menace. It's a 9-heterogenous-core beast that works at 3.2GHz and power levels that make your i7s look like a money hole. Data on this processor is EXTREMELY hard to find if you're not a 6-figure-income engineer or a firm that specializes in server design, because while its impetus was the PS3 for design, IBM is looking at it as a cash cow, and not through end-user sale. You can't buy these chips directly. You can't get a datasheet. You can't even get ballouts or pinouts for this thing.

I'm self-admittedly very butt-headed about my work, and won't take no for an answer when it comes to certain things, however, so I realized a simple solution. Reballed chips or chips salvo'd from defunct PS3s, of which eBay is no shortage of a supplier on. I've already reconstructed the package design from reballing stencils and photos, and I'm pulling the pinouts and schematic information off service manuals.

If you aren't at least intrigued by the implications of custom, open-source hardware involving this beast, you probably should be. Looking at the architecture and IEEE texts show that this processor is particularly suited to high-speed cryptography and deciphering. At consuming nominally under 20W of power, you could build an array of 20 of these and if bussed correctly with external hardware, you'd have access to 150-180 physical cores for under 400W.

My idea is to put the main processors and necessary RAM on a 2-to-a-DIMM module configuration, with multiple sockets on a mainboard. That way, if one CPU goes under, it doesn't jeopardize the whole unit, and that way, modules (and thus, processing horsepower) can be added-on to increase throughput.

There's a LOT more to be done than that, but I feel it's a good operating concept.

So, here are my questions, although this isn't all of them:

1. How is data communication achieved? We treat mining clients like black-boxes. That is, we don't worry about what is actually going on under the hood or how it is pulling data from the server and processing it to feed it back. How is this process achieved? The first step in building quite frankly, one bitchin' Bitcoin/Litecoin rig is figuring out how to get the data out of the clouds and into the hardware. It's doable by people because things like cgminer are around, although I've usually found that looking at another person's source isn't usually helpful if you're not experienced with the coding methodologies of that person. I find it to be like a signature. Can anyone shed light on how to accomplish this? I'm not afraid to admit I don't have much of a clue about what is actually going on here.

2. This may seem like a n00b question, but HOW is the process of 'mining' achieved? My cryptography/decryption exploits to date have centered around high-speed arrays designed to do nothing more than the good ol' brute-forcing. Obviously, with things such as SHA-256 and large-key-size algos, if you try to brute-force them, we'd all be here waiting on results until our bones turn to dust. How is a 'block' generated? What is the actual process of contributing one's part to the mining effort via a server?

I'm not coming into this question unprepared, I've done some looking into it, but the easy-to-find answers are clear as mud.

To operate these things at 100% of possible efficiency, at least part of it probably needs to be done in ASM (assembly) - that is, the actual algo processing needs to be in assembly so we are ABSOLUTELY sure we are getting the most optimized code for the device. Lot of use advanced microprocessors are if they're operating at less than their maximum efficiency.

Is there anyone who either can help me answer those questions or would want to turn this into a project? If I can get people to tackle this with me, we get it done faster, and we all win.
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 10, 2013, 10:35:15 PM
Last edit: November 17, 2013, 09:16:41 AM by blasthash
 #2

PROJECT INFORMATION / STATUS

You can find the GitHub repo here.

Project donation fund address: LQ2L4gJR7onVD84N7jZqRezc7n16fjDUfN

***************** BUILDLOG *****************

[11/10/13]: Working on keyspace search mappings for each core; provided one DIMM (two Cells) installed. With a 64-point vector pipe, the number of hashing attempts across both cores can be at max 4,194,303 possibilities. Provided even moderately-optimized code, this means a valid data result can be found in ideally, under a second at 3.2GHz, or at least for SHA-256.

[11/12/13]: Uploaded to the GitHub repo first revision of a temp control sketch for monitoring core temps using an Arduino. This will be revised in a later post to drive fans or pumps for water cooling, as well as possibly back-report to slow down core usage.

[11/17/13]: Got busted PS3; going to reflow it tomorrow to see if I can get it working and get into its guts. If so, I'll start writing code for it. If everything checks out, I'll greenlight the dual-processor modules for prototyping. Overclocking may also be experimented with - up to 4.0GHz.
cryptohunter
Legendary
*
Offline Offline

Activity: 2100
Merit: 1167

MY RED TRUST LEFT BY SCUMBAGS - READ MY SIG


View Profile
November 10, 2013, 10:52:48 PM
 #3

This does sound very interesting, i wasn't aware the ps3 cpu was so remarkable and efficient in terms of energy.  What cpu has the ps4 got?

blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 11, 2013, 02:49:06 AM
Last edit: November 11, 2013, 03:01:09 AM by blasthash
 #4

This does sound very interesting, i wasn't aware the ps3 cpu was so remarkable and efficient in terms of energy.  What cpu has the ps4 got?

There's no official info as far as I know, but it seems like they're ditching the Cell/B.E. Which is an utter waste because its development was over $100M between Sony, IBM, and Toshiba.

It may be working on a GPU-inspired CPU. Nvidia has really been pushing the general-processing power of their CUDA frameworks lately. It may stand that it expands a CPU on that concept.

And yes, the entire unit expends 10W or so at nominal frequency during normal operations. Constant vector ops may push it higher, but it stands to reason that this entire unit could cram 20 units (180 cores) for under the total dissipation of a desktop PC.

I'm working on getting the Cell SDK from IBM as we speak. Optimization on this code to get the performance gains I conjectured about in the buildlog will have to be tight; but by all means doable.

To do some basic, approximate math, if a key-search were distributed across all values of the nonce, running 20 CPUs on an 8-SPU 64-point vector op pipe would give us:

4,294,967,295 / (20 * 8 * 64) = 41,943 + some operations for each vector point to find the result. Even with non-optimized code, the array could complete this in under a second. It could work even faster if one extended the vector processing somewhat to the PPE core.

That's pretty remarkable results. I'll have to wait until I run some compile-tests on the actual code to get actual timing (start -> final) information, but that's amazing in-and-of itself.

My idea is to have an FPGA that performs memory "injection" - that is, say a Spartan-6 LXT - a fast one - that injects the block to be ran at a specific memory address. The CPUs would ideally be locked in a branch loop unless an interrupt signal was generated, in which case they would break out of the loop and start running the code.

It'll be a good bit to get to building the full 20-processor implementation, but with some R&D fund and resource management it won't be too bad.

Hope this gives you some good information.
DeathAndTaxes
Donator
Legendary
*
Offline Offline

Activity: 1218
Merit: 1079


Gerald Davis


View Profile
November 11, 2013, 03:00:40 AM
 #5

CPU on PS4 (and XBox One) is 8 core "classic x86" CPU made by AMD.  Using the Cell processor on PS4 made porting games very difficult and 3rd party games are much more important.  That often meant that due to the Cell (which more than one game studio blasted as a nightmare to work on) PS4 ports would come much later.

http://www.anandtech.com/show/6976/amds-jaguar-architecture-the-cpu-powering-xbox-one-playstation-4-kabini-temash
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 11, 2013, 03:03:42 AM
 #6

CPU on PS4 (and XBox One) is 8 core "classic x86" CPU made by AMD.  Using the Cell processor on PS4 made porting games very difficult and 3rd party games are much more important.  That often meant that due to the Cell (which more than one game studio blasted as a nightmare to work on) PS4 ports would come much later.

http://www.anandtech.com/show/6976/amds-jaguar-architecture-the-cpu-powering-xbox-one-playstation-4-kabini-temash

Yeah. The Cell is admittedly a monster to try and code for, mainly due to the fact that that the cores are heterogenous. 1 is a PowerPC engine (PPE), eight are SPEs that use a completely different assembly and end-compiler. It's hard to get the code to cooperate, or at least from highly-integrated software standpoints (i.e. games).
bee7
Hero Member
*****
Offline Offline

Activity: 574
Merit: 523


View Profile
November 11, 2013, 03:24:37 AM
Last edit: November 11, 2013, 03:52:59 AM by bee7
 #7

Interesting.

However, the main bottleneck of scrypt is ram access. If the CPU you choose has small amount of cache memory then you are bounded by the RAM throughput anyway. Let's assume you are using DDR3-2133 SDRAM. Then the throughput is 2133*10^6*(data bus width in bytes) bytes/sec. The scrypt requires 128Kb of data to be transferred back and forth per a single calculated hash. That means that having 64bit bus you may achieve 2133*10^6×8/(128*1024×2) = 65093 hashes a sec. If a CPU has a 32bit data bus then divide this number by 2. This is theoretical maximum. The actual performance would be lower as to be capable to perform the necessary calculations even in parallel with this memory reads and writes the CPU must execute 90 651 115 520 assembler instructions on 32bit words to meet these 65093 hashes a sec (if it has no provisions of parallel computations). And this estimation does not include the necessary SHA calculations.

Now, let's look if we use some powerful CPU that has several cores and big internal cache - big enough to keep 128k data per core without interaction with external RAM. For example Intel Core 2 Quad Q9550    32.2 Kh/s. (source: https://litecoin.info/Mining_hardware_comparison#Intel).

Thoughts?
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 11, 2013, 07:50:12 AM
 #8

Interesting.

However, the main bottleneck of scrypt is ram access. If the CPU you choose has small amount of cache memory then you are bounded by the RAM throughput anyway. Let's assume you are using DDR3-2133 SDRAM. Then the throughput is 2133*10^6*(data bus width in bytes) bytes/sec. The scrypt requires 128Kb of data to be transferred back and forth per a single calculated hash. That means that having 64bit bus you may achieve 2133*10^6×8/(128*1024×2) = 65093 hashes a sec. If a CPU has a 32bit data bus then divide this number by 2. This is theoretical maximum. The actual performance would be lower as to be capable to perform the necessary calculations even in parallel with this memory reads and writes the CPU must execute 90 651 115 520 assembler instructions on 32bit words to meet these 65093 hashes a sec (if it has no provisions of parallel computations). And this estimation does not include the necessary SHA calculations.

Now, let's look if we use some powerful CPU that has several cores and big internal cache - big enough to keep 128k data per core without interaction with external RAM. For example Intel Core 2 Quad Q9550    32.2 Kh/s. (source: https://litecoin.info/Mining_hardware_comparison#Intel).

Thoughts?
First of all, just to point out, we're talking about one specific CPU in question here.

The Cell CPU has enough independent L cache (256K) to hold the data on each SPU. External RAM would be present for load/store instructions as well to give a safety buffer. At its present source, I plan on having an FPGA 'inject' the block and other data to be hashed into the RAM at a specific point. DRAM expires; but at a time frame we won't be worried about. That FPGA will then trigger an interrupt that will allow the CPU/SPE core to break out of a branching loop and start reading code off the address. With the vector units we have access to the full bore of 128 full-GP registers, each 128-bits in depth. Way code is looking now, even in a highly brutish and non-optimized state we will not need to use expensive memory stores or loads aside from the start and finish.

Obviously the FPGA will be running much slower than the SPE cores, and so it won't be able to respond to the pickup load off RAM as fast as the CPU can issue it. But all things considered, I think it's still a highly-optimizable process.
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 12, 2013, 08:49:10 AM
 #9

For anyone who's keeping tabs on this project, I've changed the name of the thread topic to better reflect what the purpose of this is.
digitalindustry
Hero Member
*****
Offline Offline

Activity: 798
Merit: 1000


‘Try to be nice’


View Profile WWW
November 12, 2013, 10:55:33 AM
 #10

Sounds interesting keep us posted . I love this shit.

Dont know if I can help , I need to find out. Looks like a hella fun project but .

- Twitter @Kolin_Quark
SpeedDemon13
Hero Member
*****
Offline Offline

Activity: 518
Merit: 500



View Profile WWW
November 12, 2013, 11:05:00 AM
 #11

The theory is sound. I'll keep an eye on this project. How much is the cell processor go per unit or bulk?

CRYPTSY exchange: https://www.cryptsy.com/users/register?refid=9017 BURST= BURST-TE3W-CFGH-7343-6VM6R BTC=1CNsqGUR9YJNrhydQZnUPbaDv6h4uaYCHv ETH=0x144bc9fe471d3c71d8e09d58060d78661b1d4f32 SHF=0x13a0a2cb0d55eca975cf2d97015f7d580ce52d85 EXP=0xd71921dca837e415a58ca0d6dd2223cc84e0ea2f SC=6bdf9d12a983fed6723abad91a39be4f95d227f9bdb0490de3b8e5d45357f63d564638b1bd71 CLAMS=xGVTdM9EJpNBCYAjHFVxuZGcqvoL22nP6f SOIL=0x8b5c989bc931c0769a50ecaf9ffe490c67cb5911
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 12, 2013, 12:37:27 PM
 #12

The theory is sound. I'll keep an eye on this project. How much is the cell processor go per unit or bulk?

That's one of the principal issues - the source of units themselves.

Right now the only practical route is to buy busted PS3s for around the $50 range and flow off the Cells. Keeping in mind, a quad i7 is about $150 for the laptop variety, so provided nothing else, it's still not a bad deal. Test and resell the PSUs, drives and Wi-Fi / BT modules, and with any luck, the processor might be valued at free.

Problem is, a malfunctioning PS3 doesn't exactly have a steady price, but for R&D purposes the second-hand route works fine. If this project can produce a functioning prototype, I may start a crowd-sourced Kickstarter to finance a small contract to IBM for the chips - they're keeping these things locked down TIGHT.

A slight bonus that I'll do once I get the reflow chip salvage up and running for members of the forum is sell the mobos and some of the RSXs (essentially an Nvidia GPU) for experimenting/parts replacement/wall decoration, and for cheap.
SpeedDemon13
Hero Member
*****
Offline Offline

Activity: 518
Merit: 500



View Profile WWW
November 12, 2013, 01:22:08 PM
 #13

The theory is sound. I'll keep an eye on this project. How much is the cell processor go per unit or bulk?

That's one of the principal issues - the source of units themselves.

Right now the only practical route is to buy busted PS3s for around the $50 range and flow off the Cells. Keeping in mind, a quad i7 is about $150 for the laptop variety, so provided nothing else, it's still not a bad deal. Test and resell the PSUs, drives and Wi-Fi / BT modules, and with any luck, the processor might be valued at free.

Problem is, a malfunctioning PS3 doesn't exactly have a steady price, but for R&D purposes the second-hand route works fine. If this project can produce a functioning prototype, I may start a crowd-sourced Kickstarter to finance a small contract to IBM for the chips - they're keeping these things locked down TIGHT.

A slight bonus that I'll do once I get the reflow chip salvage up and running for members of the forum is sell the mobos and some of the RSXs (essentially an Nvidia GPU) for experimenting/parts replacement/wall decoration, and for cheap.

True, price on broken PS3 can vary per person's personal value. On craigslist, I've seen as low as $15 to as high as $100 for broken ones, but on average between $20 to $50.

CRYPTSY exchange: https://www.cryptsy.com/users/register?refid=9017 BURST= BURST-TE3W-CFGH-7343-6VM6R BTC=1CNsqGUR9YJNrhydQZnUPbaDv6h4uaYCHv ETH=0x144bc9fe471d3c71d8e09d58060d78661b1d4f32 SHF=0x13a0a2cb0d55eca975cf2d97015f7d580ce52d85 EXP=0xd71921dca837e415a58ca0d6dd2223cc84e0ea2f SC=6bdf9d12a983fed6723abad91a39be4f95d227f9bdb0490de3b8e5d45357f63d564638b1bd71 CLAMS=xGVTdM9EJpNBCYAjHFVxuZGcqvoL22nP6f SOIL=0x8b5c989bc931c0769a50ecaf9ffe490c67cb5911
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 12, 2013, 01:32:24 PM
 #14

I'll give everyone a bit more information to chew on, then I'm calling it quits for the night, otherwise my brain will be fried from thinking too much over multi-core architectures  Huh

Methods of constructing the Superminer at the macroscopic (inter-processor) level:

1: SC/DC (Smart Computer, Dumb Chips) - Master computer (standard PC on Linux distro, etc.) commands FPGAs which 'hot-inject' block data into RAM, essentially 'force-feeding' the processors. Each member of array dependent on total number of processors for what its specific task in the noncespace should be. Least nasty code, and UART data push to FPGAs simple to achieve in high level. Code would have to be altered if units were to be added or subtracted, lest the code increase in complexity for auto-detection.

2. Homogenous master node - One Cell unit boots a full Linux OS on its PPE core, and delegates its SPEs and other processor units as to what task to carry out. More nasty code, but array is not dependent on the number of processors present - that is, the code will be the same for 2 or 10 CPUs present.

3. "Democratic" node setup - All nodes (CPUs) arrayed in a headless server configuration. This takes the least work, and the least amount of inter-processor communication, at the expense of being the closest in complexity to a bunch of PS3s mounted on custom PCBs.

Personally, while I endorse the first one the most as I feel it is the 'easiest' method to implement without being a cheap-shot, the second method allows for quite possibly the most optimized results. The only issue here is that we would have to have more information on how interprocessor communication functions.

This is where we need more help. The southbridge for the Cell processor has PCI endpoints, but we need to know how it communicates and reads over that bus. This is a definite candidate for IP comm, and prevents the processor chain from having to package and publicize the data over a LAN connection like it would for GbE communication. If any of you guys know how such a communication scheme would work into code, you'd be much wanted in the project circle.

Also, if you check the second post (the project information/status post), I've added a GitHub repo link as well as a LTC address that is earmarked for project research. Anything helps, and even if you can't contribute knowledge, you'll still be able to know you helped get the Superminer on two legs.
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 12, 2013, 01:33:30 PM
 #15

The theory is sound. I'll keep an eye on this project. How much is the cell processor go per unit or bulk?

That's one of the principal issues - the source of units themselves.

Right now the only practical route is to buy busted PS3s for around the $50 range and flow off the Cells. Keeping in mind, a quad i7 is about $150 for the laptop variety, so provided nothing else, it's still not a bad deal. Test and resell the PSUs, drives and Wi-Fi / BT modules, and with any luck, the processor might be valued at free.

Problem is, a malfunctioning PS3 doesn't exactly have a steady price, but for R&D purposes the second-hand route works fine. If this project can produce a functioning prototype, I may start a crowd-sourced Kickstarter to finance a small contract to IBM for the chips - they're keeping these things locked down TIGHT.

A slight bonus that I'll do once I get the reflow chip salvage up and running for members of the forum is sell the mobos and some of the RSXs (essentially an Nvidia GPU) for experimenting/parts replacement/wall decoration, and for cheap.

True, price on broken PS3 can vary per person's personal value. On craigslist, I've seen as low as $15 to as high as $100 for broken ones, but on average between $20 to $50.

Exactly. I've already snagged one for $50 or so on eBay that I'm going to use to see if I can fix it and get it working as a standalone miner in the meantime, and if not, pure R&D and chip snag.
SpeedDemon13
Hero Member
*****
Offline Offline

Activity: 518
Merit: 500



View Profile WWW
November 12, 2013, 01:38:33 PM
 #16

I'll give everyone a bit more information to chew on, then I'm calling it quits for the night, otherwise my brain will be fried from thinking too much over multi-core architectures  Huh

Methods of constructing the Superminer at the macroscopic (inter-processor) level:

1: SC/DC (Smart Computer, Dumb Chips) - Master computer (standard PC on Linux distro, etc.) commands FPGAs which 'hot-inject' block data into RAM, essentially 'force-feeding' the processors. Each member of array dependent on total number of processors for what its specific task in the noncespace should be. Least nasty code, and UART data push to FPGAs simple to achieve in high level. Code would have to be altered if units were to be added or subtracted, lest the code increase in complexity for auto-detection.

2. Homogenous master node - One Cell unit boots a full Linux OS on its PPE core, and delegates its SPEs and other processor units as to what task to carry out. More nasty code, but array is not dependent on the number of processors present - that is, the code will be the same for 2 or 10 CPUs present.

3. "Democratic" node setup - All nodes (CPUs) arrayed in a headless server configuration. This takes the least work, and the least amount of inter-processor communication, at the expense of being the closest in complexity to a bunch of PS3s mounted on custom PCBs.

Personally, while I endorse the first one the most as I feel it is the 'easiest' method to implement without being a cheap-shot, the second method allows for quite possibly the most optimized results. The only issue here is that we would have to have more information on how interprocessor communication functions.

This is where we need more help. The southbridge for the Cell processor has PCI endpoints, but we need to know how it communicates and reads over that bus. This is a definite candidate for IP comm, and prevents the processor chain from having to package and publicize the data over a LAN connection like it would for GbE communication. If any of you guys know how such a communication scheme would work into code, you'd be much wanted in the project circle.

Also, if you check the second post (the project information/status post), I've added a GitHub repo link as well as a LTC address that is earmarked for project research. Anything helps, and even if you can't contribute knowledge, you'll still be able to know you helped get the Superminer on two legs.

The first options sound the more doable option in the short term. The second is more stream lined then the first. The last one would be the ideal one in the long run.

CRYPTSY exchange: https://www.cryptsy.com/users/register?refid=9017 BURST= BURST-TE3W-CFGH-7343-6VM6R BTC=1CNsqGUR9YJNrhydQZnUPbaDv6h4uaYCHv ETH=0x144bc9fe471d3c71d8e09d58060d78661b1d4f32 SHF=0x13a0a2cb0d55eca975cf2d97015f7d580ce52d85 EXP=0xd71921dca837e415a58ca0d6dd2223cc84e0ea2f SC=6bdf9d12a983fed6723abad91a39be4f95d227f9bdb0490de3b8e5d45357f63d564638b1bd71 CLAMS=xGVTdM9EJpNBCYAjHFVxuZGcqvoL22nP6f SOIL=0x8b5c989bc931c0769a50ecaf9ffe490c67cb5911
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 12, 2013, 01:59:30 PM
 #17

I'll give everyone a bit more information to chew on, then I'm calling it quits for the night, otherwise my brain will be fried from thinking too much over multi-core architectures  Huh

Methods of constructing the Superminer at the macroscopic (inter-processor) level:

1: SC/DC (Smart Computer, Dumb Chips) - Master computer (standard PC on Linux distro, etc.) commands FPGAs which 'hot-inject' block data into RAM, essentially 'force-feeding' the processors. Each member of array dependent on total number of processors for what its specific task in the noncespace should be. Least nasty code, and UART data push to FPGAs simple to achieve in high level. Code would have to be altered if units were to be added or subtracted, lest the code increase in complexity for auto-detection.

2. Homogenous master node - One Cell unit boots a full Linux OS on its PPE core, and delegates its SPEs and other processor units as to what task to carry out. More nasty code, but array is not dependent on the number of processors present - that is, the code will be the same for 2 or 10 CPUs present.

3. "Democratic" node setup - All nodes (CPUs) arrayed in a headless server configuration. This takes the least work, and the least amount of inter-processor communication, at the expense of being the closest in complexity to a bunch of PS3s mounted on custom PCBs.

Personally, while I endorse the first one the most as I feel it is the 'easiest' method to implement without being a cheap-shot, the second method allows for quite possibly the most optimized results. The only issue here is that we would have to have more information on how interprocessor communication functions.

This is where we need more help. The southbridge for the Cell processor has PCI endpoints, but we need to know how it communicates and reads over that bus. This is a definite candidate for IP comm, and prevents the processor chain from having to package and publicize the data over a LAN connection like it would for GbE communication. If any of you guys know how such a communication scheme would work into code, you'd be much wanted in the project circle.

Also, if you check the second post (the project information/status post), I've added a GitHub repo link as well as a LTC address that is earmarked for project research. Anything helps, and even if you can't contribute knowledge, you'll still be able to know you helped get the Superminer on two legs.

The first options sound the more doable option in the short term. The second is more stream lined then the first. The last one would be the ideal one in the long run.

My thoughts exactly. It stands to reason that if I get creative with the implementation, the hardware would need to see minimal changeover between the various iterations. Currently the FPGA I'm considering is a Spartan-6 LXT150; which would have enough I/O to pump 6 to 8 different cores with enough data.

What I was originally thinking was having each processors BIOS chip (by the principle that the BIOS flash exists at address 0x0; the first address) with the assembly code for the operation, but have each processors noncespace share coded into the assembly. While this is a simple route, it is the most expensive in terms of upgrade - each processor would have to be reflashed if the system config changed. If we get smart with the FPGA implementation and code to the CPUs though, we could have the FPGA 'prime' each processor with its share prior to work.

IBM has been nice enough to take the approach that it won't even supply datasheets without a design consultation for custom specs; as this is almost certainly out of the price scope of this project, it isn't being pursued. Right now I have two things on the chopping block:

a) Reconstruct the ballout accurately from leaked Sony service manuals;

b) You'll also notice if you take a look that there is a separate I/O channel. This is almost certainly MMIO; but without a datasheet we don't know what address each pin stands at. Until that can be tested, the only way I can think of doing this is to 'hotwire' the communications FPGA into the RAM addressing. This comes at the cost of RAM real-estate, but as store and load instructions are highly-latent instructions, ideally this processing would be using primarily only the caches in the first place.
cryptohunter
Legendary
*
Offline Offline

Activity: 2100
Merit: 1167

MY RED TRUST LEFT BY SCUMBAGS - READ MY SIG


View Profile
November 13, 2013, 03:40:50 AM
 #18

Would the ps3 cpu be any use for scrypt jane coins? the efficiency of this chip seems remarkable.

blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 13, 2013, 04:33:54 AM
 #19

Would the ps3 cpu be any use for scrypt jane coins? the efficiency of this chip seems remarkable.

Ideally, it'd be useable for any Scrypt protocol (and routine SHA-256 as well, albeit at much less efficiency (hashes per watt) than GPUs).

The efficiency is what I find to be impressive. Under full load, the PS3 slim consumes right around 70W. If we assume roughly 95% conversion for the power supply, then 67W is the draw to the board, and allowances for the GPU at roughly half plus peripherals puts power consumption of the Cell in most applications below 25W.

If we can succeed with this project to production, it would not be infeasible to sell two models of the Superminer, a 15- or 20-core variety and a 40 core menace. I'll have to see some code analytics to see what we can get to in terms of per-chip hashrate, but knowing that awhile ago, someone developed a Scrypt miner for a Linux PS3 and managed to eek out 34kH using 6 SPEs, probably not using vector instructions to their fullest, you can see the headroom that such a device could allow for.

For the first R&D build I'm going to really showcase this thing - it needs to look the part.

When I get to building, major miner-pr0n will result.
blasthash (OP)
Newbie
*
Offline Offline

Activity: 53
Merit: 0


View Profile WWW
November 13, 2013, 06:24:21 AM
 #20

Couldn't help myself, had to mock up a 19" rack panel for fun.

Let me guys know what you think. I'm going to probably design the chassis around a blue glow theme. The 6 slots are for 7-segment readouts, the I/O connectors are panel-mount Neutriks, and we have a big red emergency start button for fun.

This panel is designed for if I decide to watercool the unit. The radiator and fans are moved up front for heat in the winter and because the front of the rig is usually the least impeded (if it's on a desk or in a rack, the rear plate will be up against a wall or enclosure panel; we want efficiency - since when have you put a computer against the wall facing the wrong way?).

Again, let me know what you guys think.

The upload is of bad quality but I think it looks pretty snazzy.

http://i1272.photobucket.com/albums/y384/blasthash/CellBE_Supercomputer-1_zps2c89570b.jpg
Pages: [1] 2 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!