Bitcoin Forum
November 05, 2024, 06:48:47 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 [3] 4 5 »  All
  Print  
Author Topic: A RAM based fpga LTC miner  (Read 13847 times)
DeathAndTaxes
Donator
Legendary
*
Offline Offline

Activity: 1218
Merit: 1079


Gerald Davis


View Profile
August 31, 2013, 05:13:09 AM
 #41

all Bitcoin ASIC companies had to derive their works from existing the current SHA-2 ASICs, starting from scratch would have cost far more than the Bitcoin economy could have supplied, and far more than ASIC companies could afford to spend at their current price points.

That is not correct.  Bitcoin ASICs are essentially glorified SHA-2 calculators.  Input binary blob & target.  Output any nonces which result in SHA-2(SHA-2(blob+nonce)) < target.  Not to take anything away from what the Bitcoin ASIC companies did but the chips are just performing the "math" of the SHA-2 algorithm.  Most of the "smarts" is not the customs ASICs but in the cheap micrprocessor (Rasberry Pi or embeded computer).  Far more complex chips are made in universities every year as academic projects.    There was an ASIC in your hand calculator from the 1970s.  It didn't cost a billion dollars to design either and the tools were a lot more primitive back then.
hendo420
Sr. Member
****
Offline Offline

Activity: 420
Merit: 250



View Profile WWW
August 31, 2013, 05:14:51 AM
 #42

I'm at the point where I have to read over my posts 30 times to make sure I spelled everything rigt, and should probably get some sleep.


ebmarket.co
hendo420
Sr. Member
****
Offline Offline

Activity: 420
Merit: 250



View Profile WWW
August 31, 2013, 05:17:24 AM
 #43

There was an ASIC in your hand calculator from the 1970s.  It didn't cost a billion dollars to design either.

Calculating for inflation it may have.

ebmarket.co
minerapia
Full Member
***
Offline Offline

Activity: 168
Merit: 100


View Profile
August 31, 2013, 05:17:39 AM
 #44

Quote
all Bitcoin ASIC companies had to derive their works from existing the current SHA-2 ASICs, starting from scratch would have cost far more than the Bitcoin economy could have supplied, and far more than ASIC companies could afford to spend at their current price points. Its like if current ASIC companies decided to start using 14nm chips. It would be unimaginably expensive to create a technology that doesn't exist yet, its far cheaper to modify existing designs.

Starting from scratch is equally expensive, FYI your analogy is totally wrong. They didnt create or modify any 'technology' for the their asics, its matter of coding and tools.
next time try using google first before "I'd say its a pretty strong gut feeling."

even reading simple wiki article helps,
http://en.wikipedia.org/wiki/Application-specific_integrated_circuit

donations -> btc: 1M6yf45NskQxWXknkMTzQ8o6wShQcSY4EC
                   ltc: LeTpCd6cQL26Q1vjc9kJrTjjFMrPhrpv6j
digitalindustry
Hero Member
*****
Offline Offline

Activity: 798
Merit: 1000


‘Try to be nice’


View Profile WWW
August 31, 2013, 09:31:14 AM
 #45

But how does one account for the fact that ASIC companies are out there in the game and have invested to try to fill a market requirement, most of the investment was the time up until now , so when you see the situation from this point of view , its easy to see that they may move in that direction , I have no doubt that it wont be ABC123 .

The more one thinks about LTC , the more one tends to start getting that little paranoid conspiratorial feeling.

Then one looks back on experience and one realizes that its humans nature to do this sort of thing.

Then one realizes that only though these events , do dullards in thier owns designs help the whole.

- Twitter @Kolin_Quark
digitalindustry
Hero Member
*****
Offline Offline

Activity: 798
Merit: 1000


‘Try to be nice’


View Profile WWW
August 31, 2013, 09:35:34 AM
 #46

Ha ha the downward price pressure could be our friends trying to diversify ha ha .

Then after the Wired write up to me Nova has never looked so good.

It could turn out that Nova is one of the most honest currencies around after Nybble of course, I dont expect many to understand of course but those that do , do.

- Twitter @Kolin_Quark
divan0w
Newbie
*
Offline Offline

Activity: 43
Merit: 0


View Profile
August 31, 2013, 10:55:29 AM
 #47

So what now, VGA mining is finally over?
hope2907
Sr. Member
****
Offline Offline

Activity: 432
Merit: 250



View Profile
August 31, 2013, 11:09:28 AM
 #48

yes it is over
ssvb
Newbie
*
Offline Offline

Activity: 39
Merit: 0


View Profile
August 31, 2013, 10:01:27 PM
 #49

LTC uses the parameters (2^10, 1, 1) which results in a token 128KB max scratchpad size.  That isn't a typo it is kilobytes.
You are just forgetting to multiply this scratchpad size by the number of "cores", "threads" or some other entities (the way how you call them depends on the underlying technology) in the miner device. All these "cores" are simultaneously doing hashes calculations, each with its own scratchpad. The reason why FPGAs and ASICs work so great for SHA-256 is that the number of gates needed for a single SHA-256 "core" is really small, so one can fit an enormous amount of such cores on a single chip. But each scrypt "core" needs a scratchpad for storing intermediate data, and if the scratchpad is implemented as a SRAM memory, then the number of gates per scrypt "core" just skyrockets. You can fit significantly less scrypt "cores" on a single chip than SHA-256 "cores". There are some tricks for the scratchpad size reduction (LOOKUP_GAP is the right keyword, you can search for it in the forum), which reduce the size of the scratchpad, but this reduction is not free and results in more computations. That's why you can see some people mentioning Space–time tradeoff. The optimal lookup-gap setup depends on the balance between the memory size/performance and the computational power for doing arithmetic operations. It is also possible to use the external memory instead of on-chip SRAM, but the external memory must naturally have wide buses and a lot of bandwidth (memory latency is not critical for scrypt though). The scrypt GPU miners are relying on the GDDR5 speed, with a popular scratchpad size configuration being 64KB (lookup-gap=2), which indicates that the memory speed is the bottleneck and the excessive computational power is already traded off in order to reduce the burden on the memory.

I also suggest checking https://github.com/ckolivas/cgminer/blob/master/SCRYPT-README to find a lot of information, which is intended to be user-comprehensible:
"--lookup-gap
This tunes a compromise between ram usage and performance. Performance peaks at a gap of 2, but increasing the gap can save you some GPU ram, but almost always at the cost of significant loss of hashrate. Setting lookup gap overrides the default of 2, but cgminer will use the --shaders value to choose a thread-concurrency if you haven't chosen one.
SUMMARY: Don't touch this"


Quote
The default Scrypt parameters (2^14, 8, 1) result in a 16MB max scratchpad size roughly 128x as "memory hard".
The LTC scrypt parameters are sufficient for making sure that GPUs are required to have a lot of high bandwidth memory for decent hashing speed. It's all that matters. Increasing the size of the scratchpad is not going to bring any improvements (if by improvements you mean making CPU mining more competitive). Actually some scrypt based cryptocurrencies tried to make it more "memory hard" and failed to really fend off the GPUs. Also the 128x claim is just silly because you are forgetting that bigger scratchpads also inevitably mean more arithmetic operations involved in a single hash calculation. As I mentioned earlier, it is the balance between the memory speed and the arithmetic calculations speed that is important. And the LTC scrypt somehow managed to get it right, even if this actually happened unintentionally.

Regarding the FGPA device on the picture at the start of this topic. Looks like it is going to have external memory bandwidth roughly similar to what is available for the triple channel DDR3 systems. This is still less than the memory bandwidth of a mid-range GDDR5 equipped GPU. I doubt that this FPGA device is capable of demonstrating any mind blowing hashing speed. Still if it manages to scale well with the lookup-gap increase, have low power consumption and/or low device cost, then it might be possibly competitive.

BTW, the appearance of competitive FPGA devices might make people more motivated to try better optimizing scrypt for AMD GPUs (squeeze every last bit of performance and/or reduce power consumption). Bring it on, this stuff may become fun again Wink
DeathAndTaxes
Donator
Legendary
*
Offline Offline

Activity: 1218
Merit: 1079


Gerald Davis


View Profile
August 31, 2013, 10:27:53 PM
Last edit: September 01, 2013, 05:19:40 AM by DeathAndTaxes
 #50

You used a lot of double speak.  First I am aware of the space time tradeoff however rather than explain it in every single post it is useful to look at the max scratchpad size.   128KB scratchpad is going to require less memory and less bandwidth than a 16MB scratchpad regardless of what space time tradeoff is employed.  A device only has a finite amount of computing power and while you can trade time for space needing less space to start with always helps.

As for higher parameter value having no effect on the relative performance of CPU, GPU, and FGPA/ASICs that is just false.  Scrypt was designed to be GPU and specialized device resistant.   This is important in password hashing as most servers are using CPU and attacker will likely choose the most effective component for brute forcing.  By making CPU performance superior it prevents attackers from gaining an advantage.    You can test this yourself.  Modify cgminer OpenCL kernel to use a higher p value.  Around 2^14 GPU relative performance is essentially gone.  It is comparable to a CPU throughput.  At 2^16 GPU relative performance is falling far behind.  At 2^20 the GPU never completes.

You say one one hand that the memory requirement doesn't matter and on the other hand that FPGA are hard because they need lots of memory and wide buses.  Well guess what the higher the p value the MORE memory and wider busses that is needed.  At 2^14 roughly 128x the max scratchpad size is going to mean 128x as much bandwidth is necessary.   So the lower the p value the EASIER the job is for FPGA and ASIC builders.  They can use less memory and narrower busses that means less cost, less complexity, higher ROI%.   Sure one isn't required to use max scratchpad size because one can compute on the fly but once again the whole point to the space-time tradeoff is that the advantage to doing so is reduced.  

Lastly yes the 128KB is per core but so is the 16MB using the default parameters.   If 128KB per core increases memory, bandwidth, and/or die size per core then a 16MB requirement would maker it even harder.  So yes the parameters chosen by LTC makes it 128x less memory hard than the default.  You use circular logic to say the max scratch pad size is irrelevant because one can optimize the size of the scratchpad to available resources.  This doesn't change the fact that due to the space-time tradeoff you aren't gaining relative performance.  Using a higher max scatchpad requires either more memory and bandwidth OR requires more computation.  The throughput on the FPGA,  GPU, CPU is going to be reduced.  Now if they were all reduced equally it wouldn't matter all that matters is relative not nominal performance.  However the LTC parameters chosen are horrible for CPU usage.   CPU have a limited ability for parallel execution.  Usually 4 or 8 independent cores.   128KB per core * 8 = 1MB.  That's right today with systems that can install multiple GB for very cheap cost the Scrypt paramters chosen bottleneck performance on a CPU.  GPU on the other hand are highly parallel execution engines but they have limited memory and that memory is at a higher cost than CPU have access to.



TL/DR
Whatever the relative performance of this FPGA is to a CPU miner it would be WORSE if the p value was higher.   LTC decision to use a low p value makes what otherwise would be a nearly impossible task into one which is merely challenging.  
hasle2
Full Member
***
Offline Offline

Activity: 122
Merit: 100


View Profile
August 31, 2013, 11:53:47 PM
 #51

I wish I had the time to learn how to design these things. Looks like so much fun.
01BTC10
VIP
Hero Member
*
Offline Offline

Activity: 756
Merit: 503



View Profile
September 01, 2013, 12:01:00 AM
 #52

Waiting anxiously to read the specs on this  Cheesy
tacotime
Legendary
*
Offline Offline

Activity: 1484
Merit: 1005



View Profile
September 01, 2013, 02:12:38 AM
 #53

TL/DR
Whatever the relative performance of this FPGA is to a CPU miner it would be WORSE if the p value was higher.   LTC decision to use a low p value makes what otherwise would be a nearly impossible task into one which is merely challenging.  

I doubt it.  You should have a look at Solar Designer's TMTO data with N=2^14, r=2^3, p=1.



As you can see, as memory exponentially decreases integer ops exponentially increase.  He was easily able to get the memory usage into the kilobytes and still crank out hashes.  I'd guess that exactly the same is true with N=2^10, r=1, p=1 too.  It's the same balancing act you run into no matter what value you use for N or r; at higher N you may increase the difficulty by a smaller constant factor, but overall I doubt increasing N or r will make scrypt much more FPGA/ASIC unfriendly when they finally iron out the FPGA implementation.

Code:
XMR: 44GBHzv6ZyQdJkjqZje6KLZ3xSyN1hBSFAnLP6EAqJtCRVzMzZmeXTC2AHKDS9aEDTRKmo6a6o9r9j86pYfhCWDkKjbtcns
vnhyp0
Member
**
Offline Offline

Activity: 106
Merit: 10


View Profile
September 01, 2013, 02:27:40 AM
 #54

This look like a potentially interesting development. Capitalism really drives innovation to extremes in some cases.

I look forward to more information about this project, beekeeper.

QQ: 2228207157
DeathAndTaxes
Donator
Legendary
*
Offline Offline

Activity: 1218
Merit: 1079


Gerald Davis


View Profile
September 01, 2013, 03:13:39 AM
Last edit: September 01, 2013, 05:24:16 AM by DeathAndTaxes
 #55

As you can see, as memory exponentially decreases integer ops exponentially increase.  He was easily able to get the memory usage into the kilobytes and still crank out hashes.  I'd guess that exactly the same is true with N=2^10, r=1, p=1 too.  It's the same balancing act you run into no matter what value you use for N or r; at higher N you may increase the difficulty by a smaller constant factor, but overall I doubt increasing N or r will make scrypt much more FPGA/ASIC unfriendly when they finally iron out the FPGA implementation.

Yes that is the space-time tradeoff and they used it to reduce the memory requirements to roughly what LTC Scrypt requires EXCEPT to do so requires a 100x increase in integer performance.  If anything you just showed how weak LTC Scrypt is.   Another way to look at it is say you had a FPGA card with output of X kh/s using the full scratchpad size of 128KB.   Now trying to run N 2^14 you don't have sufficient memory or bandwidth but like the chart shows you could use the space-time tradeoff to reduce the memory requirement to 128KB.  Great the memory requirement is similar to LTC Scrypt ... EXCEPT you now need either a FGPA with 100x the integer performance (how much do you think that is going to increase the cost) OR you are going to have 1/100th the hashrate.   
ssvb
Newbie
*
Offline Offline

Activity: 39
Merit: 0


View Profile
September 01, 2013, 03:22:02 AM
 #56

You used a lot of double speak.
Nah, it's just you still having some trouble understanding Smiley

Quote
First I am aware of the space time tradeoff however rather than explain it in every single post it is useful to look at the max scratchpad size.   128KB scratchpad is going to require less memory and less bandwidth than a 16MB scratchpad if everything else is the same.
Let's have a look at the definition of what is "memory hard" in the scrypt paper: "A memory-hard algorithm is thus an algorithm which asymptotically uses almost as many memory locations as it uses operations; it can also be thought of as an algorithm which comes close to using the most memory possible for a given number of operations, since by treating memory addresses as keys to a hash table it is trivial to limit a Random Access Machine to an address space proportional to its running time", "Theorem 2. The function SMixr(B, N) can be computed in 4 * N * r applications of the Salsa20/8 core using 1024 * N * r + O(r) bits of storage"

You can see that scrypt is just equally memory hard for all the scratchpad sizes. The ratio between the number of scratchpad access operations and the number of salsa20/8 calculations remains the same.

Quote
As for higher parameter value having no effect on the relative performance of CPU, GPU, and FGPA/ASICs that is just false.
What I said was "Increasing the size of the scratchpad is not going to bring any improvements (if by improvements you mean making CPU mining more competitive)". How did it turn into "no effect on the relative performance of CPU, GPU, and FGPA/ASICs"? CPU miners are at a serious disadvantage right now, so the effect on the relative performance must be really significant in favour of CPU in order to turn the tables.

In practice, increasing the size of the scratchpad will make it harder to fit in CPU caches. To mitigate the unwanted latency of random accesses, scrypt uses parameter 'r'. Basically if r=1 (the default for LTC), then the  scratchpad is accessed as 128 byte chunks at random locations. If r=8, then the memory accesses are done as 1024 byte chunks at random locations. In the former case, the cache miss penalty is hit once per 128 bytes. In the latter case, the cache miss penalty is hit once per 1024 bytes (the sequential accesses after the first cache miss are automatically prefetched, at least in theory). Having high 'r' value reduces the effect of memory access latency penalty for the CPU. And the latency is not an issue for the GPU in the first place. Additionally, if the CPU has to access the memory, then the memory controller must have enough bandwidth. For example, my Core i7 860 processor currently has ~29 kHash/s performance in cpuminer. And the STREAM benchmark (built as multithreaded with OpenMP support) shows ~10GB/s of practically available memory bandwidth. These ~10GB/s of memory bandwidth would translate to the theoretical hard hashing speed limit ~38 kHash/s if the CPU caches were not helping. There is not much headroom as I can see, and my processor does not even have AVX2.

Quote
Scrypt was designed to be GPU and specialized device resistant.   This is important in password hashing as most servers are using CPU and attacker will likely choose the most effective component for brute forcing.  By making CPU performance superior it prevents attackers from gaining an advantage. You can test this yourself.  Modify cgminer OpenCL kernel to use a higher p value.  Around 2^14 GPU relative performance is essentially gone.  It is comparable to a CPU throughput.  At 2^16 GPU relative performance is falling far behind.
This is confusing, did you actually mean the 'N' value? Please just provide the patches for your changes to cgminer and cpuminer that you used for this comparison.

But in general, the GPU tuning is not easy because there are many parameters to tweak. Poorly selected configuration can result in poor hashing performance even for the LTC scrypt. You can find many requests for help with the configuration in the forum. So your poor performance report does not mean anything.

Quote
At 2^20 the GPU never completes.
And surely you can raise the memory requirements so high, that they would make mining problematic on the current generation of the video cards purely thanks to insufficient amount of GDDR5 memory. But guess what? In a year or so, the next generation of video cards will have more memory and suddenly GPU mining will again become seriously better than on the CPU. Designing the algorithm around some magic limits which may become ineffective at any time is not the best idea. The current "small" scratchpad size for scrypt focuses on memory bandwidth instead of relying on artificial limits such as memory size (which can be easily increased, especially in the custom built devices).

Quote
You say one one hand that the memory requirement doesn't matter and on the other hand that FPGA are hard because they need lots of memory and wide buses.  Well guess what the higher the p value the MORE memory and wider busses that is needed.  At 2^14 roughly 128x the max scratchpad size is going to mean 128x as much bandwidth is necessary.
Yes, but only if also backed by roughly 128x more computational power. And likewise, the enormous computational power of FPGA/ASIC must be backed by a lot of memory bandwidth, otherwise it will be wasted.

Quote
So the lower the p value the EASIER the job is for FPGA and ASIC builders.  They can use less memory and narrower busses that means less cost, less complexity, higher ROI%.   Sure one isn't required to use max scratchpad size because one can compute on the fly but once again the whole point to the space-time tradeoff is that the advantage to doing so is reduced.
They can't use slower external memory, because it already needs to be damn fast.

Quote
Lastly yes the 128KB is per core but so is the 16MB using the default parameters.   If 128KB per core increases memory, bandwidth, and/or die size per core then a 16MB requirement would maker it even harder.
Yes, the absolute hashing speed would just drop significantly with the 16MB scratchpad. But it would drop on CPU, GPU, FPGA or any other kind of mining device.

Quote
So yes the parameters chosen by LTC makes it 128x less memory hard than the default.
Sigh. Please just read the definition of "memory hard" in the scrypt paper.

Quote
You use circular logic to say the max scratch pad size is irrelevant because one can optimize the size of the scratchpad to available resources.  This doesn't change the fact that due to the space-time tradeoff you aren't gaining relative performance.  Using a higher max scatchpad requires either more memory and bandwidth OR requires more computation.  The throughput on the FPGA,  GPU, CPU is going to be reduced.  Now if they were all reduced equally it wouldn't matter all that matters is relative not nominal performance.
Wait a second, where does this "reduced equally" come from? The space-time tradeoff just means that if you have a system with excessive computational power but slow memory, then you can still tweak lookup-gap to trade one for another. That is instead of being at a huge disadvantage compared to more optimally balanced system. This kinda "equalizes" the systems with vastly different specs, which is a total opposite of "reduces equally".

Quote
However the LTC parameters chosen are horrible for CPU usage.   CPU have a limited ability for parallel execution.  Usually 4 or 8 independent cores.   128KB per core * 8 = 1MB.
This just means that you don't know much about the CPU mining. The point is that modern superscalar processors can execute more than one instruction per cycle, this is called instruction level parallelism. Also there are instruction latencies to take care of. In order to fully utilize the CPU pipeline, each thread has to calculate multiple independent hashes in parallel. Right now cpuminer calculates 3 hashes at once per thread (or even 6 with AVX2). Now do the math.

Quote
That's right today with systems that can install multiple GB for very cheap cost the Scrypt paramters chosen bottleneck performance on a CPU.  GPU on the other hand are highly parallel execution engines but they have limited memory and that memory is at a higher cost than CPU have access to.
The memory must be also fast, not just large.

TL/DR

For the external memory, I'm assuming that sufficient size is available to be used for as many cores as practically useful (in this case only the memory bandwidth is an important factor). For the on-chip SRAM memory, the bandwidth should be not a problem as the memory can be tightly coupled with each scrypt core, but the size does matter and can't be large enough (the CPU caches are really small when compared with the DDR memory modules for a reason). The current best performing scrypt mining devices (AMD video cards) are relying on the external memory bandwidth. This FPGA design seems to be essentially a GPU clone.
FiiNALiZE
Hero Member
*****
Offline Offline

Activity: 868
Merit: 500

CryptoTalk.Org - Get Paid for every Post!


View Profile
September 01, 2013, 03:28:18 AM
 #57


 
                                . ██████████.
                              .████████████████.
                           .██████████████████████.
                        -█████████████████████████████
                     .██████████████████████████████████.
                  -█████████████████████████████████████████
               -███████████████████████████████████████████████
           .-█████████████████████████████████████████████████████.
        .████████████████████████████████████████████████████████████
       .██████████████████████████████████████████████████████████████.
       .██████████████████████████████████████████████████████████████.
       ..████████████████████████████████████████████████████████████..
       .   .██████████████████████████████████████████████████████.
       .      .████████████████████████████████████████████████.

       .       .██████████████████████████████████████████████
       .    ██████████████████████████████████████████████████████
       .█████████████████████████████████████████████████████████████.
        .███████████████████████████████████████████████████████████
           .█████████████████████████████████████████████████████
              .████████████████████████████████████████████████
                   ████████████████████████████████████████
                      ██████████████████████████████████
                          ██████████████████████████
                             ████████████████████
                               ████████████████
                                   █████████
.CryptoTalk.org.|.MAKE POSTS AND EARN BTC!.🏆
antimattercrusader
Sr. Member
****
Offline Offline

Activity: 308
Merit: 250



View Profile
September 01, 2013, 04:10:13 AM
 #58



@BFL Josh

Can we pre-order these though BFL???


~BCX~


lmao. STFU and take my BTC LTC YAC!!!!

Why can't we pre-order a 10GH Scypt-Jane unit through BFL at this time?

You see that 600gh/s Card? I think I'll pass....Still have not received my Jalepeno.. but bought a bunch of block erupters and a blade from http://www.wtcr.ca and got it next day in the US.

BTC: 13WYhobWLHRMvBwXGq5ckEuUyuDPgMmHuK
digitalindustry
Hero Member
*****
Offline Offline

Activity: 798
Merit: 1000


‘Try to be nice’


View Profile WWW
September 01, 2013, 05:57:56 AM
 #59

So cutting through all the " im smarter and understand "

It is basically is I stated before that an ASIC will just be a more effectient version of a GPU system as opposed to say a reorganization of the fundamentals .

I.e an ASIC may provide from 4x to maybe 10x  efficency and less power / heat .

So therefore sCrypt may be the domain of ASIC in the future .

And then if there were to be a next possible iteration it would be out quicker .


BFL CEO

2 to 4 weeks ?

- Twitter @Kolin_Quark
YipYip
Hero Member
*****
Offline Offline

Activity: 574
Merit: 500



View Profile
September 01, 2013, 09:54:38 AM
 #60

Wheres the XPM FGPA Huh

...lolz

OBJECT NOT FOUND
Pages: « 1 2 [3] 4 5 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!