Latest posts of: gat3way

Show Posts
Pages: [1] 2 3 4 5 6 7 8 9 10 11 12 13 »

Bitcoin / Mining speculation / Re: BFL Labs just like Get-rich-quick-schemes

on: July 14, 2012, 11:23:33 PM

Quote 1) Make incredible claims about your future product promise 1 to 1 trade ins on your currently sold product 2) Point all your PR tools to talking favorably about your future product and minimizing your previous failure to meet pre-release numbers 3) Take peoples $ for future orders, this combined with money not spent due to uncertainty reduces your competitors sales and thus available funds to develop their future products 4) Some profit 5) Plow some of that money into developing something that at least might, perhaps come close to your seat of the pants PR numbers 6) 7) More profit What about the following: 1) Make incredible claims about your future product promise 1 to 1 trade ins on your currently sold product 2) Point all your PR tools to talking favorably about your future product and minimizing your previous failure to meet pre-release numbers 3) Take peoples BTC for future orders, this combined with money not spent due to uncertainty reduces your competitors sales and thus available funds to develop their future products 4) Some profit 5) Do not plow any of that money into developing something that at least might, perhaps come close to your seat of the pants PR numbers 6) Declare bankruptcy 7) Return the money to everyone that paid by wire transfer. Fuck up all the greedy idiots stupid enough to preorder in BTC. Still some profit 9) Move to some nice sunny place and enjoy.

Other / Off-topic / Re: Dual use ASICs, Mining and Cracking

on: July 05, 2012, 08:25:10 AM

Of course it is possible to design a dual-purpose (SHA256 cracking/mining) chip, it would very likely mean larger die size and more layers thus higher NRE costs and higher end-product costs. End result being customers paying more to get the same hashrate. There isn't that much demand for SHA256 cracking actually, in fact sha256 is not widely used, thus it is highly unlikely you'd ever pay off your ASIC miner in case bitcoin collapses simply because demand is low. A better idea would be to create a dual-use chip that can perform custom number of PBKDF2 iterations, it can then be easily reused by software to crack different stuff (WPA-PSK, ZIP 3.x, protected OpenOffice docs, FileVault, encrypted IOS/blackberry backups, etc). Problem being PBKDF2-SHA1 is so radically different as compared to SHA256 that you'd be much better off selling two different boards together. Overall though fast ASICs are not quite useful for hash cracking unless you are a three-letter agency that can afford producing lots of different designs. Also unlike bitcoin mining, it is not all about speed, cleverly exploiting low password entropy by reducing keyspace in some way, generally works better (otherwise cracking passwords of length >=10 would be extremely uncommon). Simple bruteforce attacks are dumb and generally it's much more practical to write some good wordlist mangling rules as compared to bruteforcing. Even statistical attacks like ones based on markov models are generally much better than bruteforcing. Yet markov attacks require good input models and rule mangling requires much more skill as compared to bitcoin mining.

Other / Off-topic / Re: Dual use ASICs, Mining and Cracking

on: July 04, 2012, 11:00:03 PM

Not with an ASIC, not even for sha256(sha256(password))-type hashes, not even for the simplest single-hash case as opposed to the hashlist scenario which is much more complicated. Bitcoin mining is different as compared to hash cracking, inputs follow a certain pattern. It doesn't have to worry about candidate generation, full-length hash comparisons and more complex cases involving comparisons against lists of hashes where hash bitmaps or similar technique needs to be employed for early rejects in order to reduce the SLOW host-device transfers. Sorry, you can't profit on mining ASICs in case something generally f***s up.

Bitcoin / Mining / Re: Intel 50-core Knights Corner

on: June 12, 2012, 09:24:23 PM

Quote Well this is quite dishonest, you're claiming one scrypt operation can keep each KC-core utilized? I doubt it. It will probably require more than 4 scrypt operation per KC-core to keep the hardware utilize, so the memory shoots up 12.8 GB... also out of practical limits for an add-on board. It can keep it. In the CPU world you don't hide latencies by scheduling other threads on a core when a memory-bound thread is stalled on a memory access - the opposite, context switches are expensive (hyperthreading being a special exception here but there you have two register sets per core, and things are different). Since you are "1337 c0d3r" I assume you've written a compute-intensive multithreaded application some time ago. Did increasing the threads count beyond the number of CPUs improve performance because you somehow "utilized" the cores better? Or did just the opposite happen because all you did was introducing scheduling contention? Quote You can decouple the fetching and decoding from the execution. Instructions do not execute until they are ready. Again, pipelines can't help much in the situation where an instruction depends on the result of a previous one. SHA256 has 64 steps, and each step depends on the result of the previous one. Now there are a number of independent instructions within each step, this is not unlimited though. Quote This is a disadvantage of VLIW4/5 and has nothing to do with GPGPU. Inadequate register/shared memory is a disadvantage of any GPU, not only VLIW ones. That makes them much less suitable for memory-intensive algorithms, even if they are embarassingly parallel. Moreover, resource-limited occupancy is a general GPGPU problem, far away from being bitcoin-related or VLIW-related, it is a problem even for ALU-bound kernels like the bitcoin one. Quote Of course you would increase the GPR if you increase the number of ALU per CU. But note that a CU generates a structural dependency, a dependency that we created in order to accommodate GPGPU. What makes you think you would increase the registers count if you increase the ALU units? I see....mmmm....no relation between both. Quote If you were to make an ASIC miner, you sure as hell dont need a crapton of GPRs or CUs or "wavefront"... all you would need is tons and tons of ALUs. Care to elaborate what "ALU" means in terms of ASIC?

Bitcoin / Mining / Re: Intel 50-core Knights Corner

on: June 12, 2012, 07:31:42 PM

Quote Wouldn't the same exact restrictions on GPU applies to KC then? In fact it would pretty much apply to all large parallel architectures. A modern-day GPU like 7970 has 2048 SPs and since in order to keep the hardware utilized you need at least 4 wavefronts/CU, you should schedule the kernel with a NDRange of at least 8192. That makes 819216 = 131GB of video memory which is far beyond what 7970 actually has. KC has 50 cores, assuming you use SSE2 registers (cause AVX does not have integer arithmetic on ymms - it would come with AVX2), then you end up with the equivalent of 200 GPU workitems or 20016=3200MB of memory, quite within practical limits. Quote Just what do you think bitcoin mining is? bitcoin mining by definition has nothing to do with parallel computing. You could do it on a slow single-core MIPS processor for example, of course it would not be cost-effective. Quote You clearly do not understand how computer architecture or dependency resolution works. Waiting on the results of a previous instruction? Just start work on a non-dependent instruction. Provided that there is a non-dependent instruction to fetch/decode/execute. Quote The current bitcoin mining hardware is restricted by the number of ALUs. True there might be a point where you need to increase the registers, decoders, scheduling to feed all those ALUs, but that's not until you increase the ALU resources drastically from where they are now. Of course it is limited by the number of ALU units, but that's not the only limitation. It is also limited by occupancy. Occupancy is limited by the GPR usage. Even a well-optimized kernel uses enough GPRs to limit the number of wavefronts/CU. That's why for example on VLIW hardware, 2-component vectors worked best. uint2 does not provide enough independent instructions to utilize the VLIW4/VLIW5 bundles and so ALUPacking was far from 100%. On the other hand, going to say uint4 while improving ALUPacking, ironically worsens performance because it requires more GPRs thus less wavefronts can be scheduled and we have less occupancy, underutilizing the hardware. AMD could make their hardware much better suited to bitcoin mining (and not only) if they increased their GPUS' register file, but they decided that would be enough. Generally yes, putting more ALUs would make the hardware faster but also having more GPRs per CU would definitely make it faster too. There isn't much use in more ALUs if you can't keep them busy. Right now, bitcoin kernels are a compromise and the hardware is never completely utilized. Even better example being NVidia's Kepler. 680GTX has 1536 ALUs, that's three times more than a fast Fermi GPU like 580GTX. Anyway, practical results show 580 is faster than 680 at bitcoin mining (and any other ALU-intensive GPGPU work for that matter). The reason? They went from grouping 32 cores in a CU to 192 cores but instead of increasing 6 times the register file, they did that just 3 times and you end up with having 2 times less registers than you used to have with Fermi. The result being you can't have proper occupancy and alas - the 3x increase in ALUs is practically money for nothing. Kepler is not a GPGPU arch and GK110 unfortunately is not diverging away from that.

Bitcoin / Mining / Re: Intel 50-core Knights Corner

on: June 12, 2012, 02:46:22 PM

Quote from: AzN1337c0d3r on June 08, 2012, 10:58:52 PM Actually, litecoin is dependent on cache-speed (see this page), so depending on the cache-organization in Knight's Corner, it might not even end up being faster than a regular CPU. scrypt is sequential memory-hard algorithm and is dependent (mostly) on memory access speed. Since memory requirements are enormous (16MB per single scrypt operation by default) and since access pattern is random, CPU caches would not help a lot. It does help much more with bcrypt though where the state is just 4KB and can fit in L1 cache completely. scrypt is unimplementable on GPUs for other reasons, mostly that there is just not enough memory - assuming you can utilize ~2GB of videoram, that would feed 128 workitems or 2 wavefronts which is quite inadequate to keep even one of the CUs properly utilized and definitely not enough to hide any memory latency. Quote SHA256 hashing only needs 1 thing: lots and lots of ALUs. Stuff like AVX, threads, and x86 just introduce structural dependencies that slow down the process. No, it needs parallelism. Lot's of ALUs won't help unless you can feed them with enough work to do. SHA256 round steps are sequential and there are dependencies, thus you won't perform one SHA256 operation X times faster by just throwing in X times more ALUs. You would likely benefit a _lot_ more if you can run several independent SHA256 operations on those X ALU units in parallel manner.

Bitcoin / Mining speculation / Re: difficulty skyrocketing and price stagnant

on: May 12, 2012, 12:06:24 PM

Quote from: jjshabadoo on May 12, 2012, 12:58:04 AM This is bullshit, why the hell isn't bitcoin following the basic laws of supply and demand ? What the hell is going on with the hashrate ? It can't be FPGA's yet and why the hell would people be throwing money at GPU's right now? just weird. It is following them. But the thing is, miners have little to do with supply and nothing to do with demand.

Bitcoin / Mining / Re: GPU vs FPGA help me out..

on: April 03, 2012, 12:17:30 PM

It's more like hashcat can be reprogrammed to work with FPGAs, not the opposite.

Other / CPU/GPU Bitcoin mining hardware / Re: GTX 680 Perf Estimations

on: March 24, 2012, 12:44:23 AM

Quote A hypothetical GTX 580 SM core with 96 shaders@772MHz should still have the same performance of a real SM with 48 shaders@1544MHz, because it could only process 4 instructions per clock cycle (2 warps). A GTX 680 SMX core however is capable of doing 8 instructions per clock cycle (4 warps). It's the number of warp schedulers + shaders to fill that ultimately determines max performance. Nope. I think you don't quite understand how a GPU functions. Best approximation I can give you is with hyperthreading that has 4 register sets rather than 2. So yes, if there is a memory fetch operation, there would be 3 warps in-flight rather than 1. So yes, Kepler would be good for memory-intensive tasks (which might explain the AES case if they blowed up the lookup tables and did not fit them in __local memory). But no, there is no 4x instruction throughput. In that aspect, it's the same as Fermi. BTW a warp does not "execute" for just one clock cycle, it executes for much more, more than 20 clock cycles and there is of course a pipeline. With sm_1x architecture, the pipeline was fucked up and new instructions were fetched/retired once per 4 clocks. Fermi impoved that to once per 2 clocks. From what I read in the pdf, Kepler does exactly the same as Fermi. Now the question is, sm_21 archs introduced something like out-of-order execution where similar instructions on different independant data could be "batched". This in turn lead to vectorizing OpenCL code for sm_21 and stuff like GTX460 being ~60% faster when uint2 vectors used. I am really wondering how far did they got with that in GK104

Other / CPU/GPU Bitcoin mining hardware / Re: Preliminary GTX 680 Test Results - Tom's Hardware Forwarded to Xtreme Systems

on: March 21, 2012, 10:00:48 PM

Hello this is GPGPU, not gaming. You have a common algorithm, not some binary blob that vendors create application profiles for by introducing all kind of tricks to improve performance. No miracles and no marketing gibberish, just simple maths. You have 1536 shaders clocked at 1008 versus 2048 shaders clocked at 925. With GCN, both architectures are now scalar and comparisons are even easier because bitcoin employs a simple algorithm that is extremely ALU-bound and not memory-intensive and does not involve lots of branching. In the best case where NVidia implemented bitwise rotations and bitselect, difference performance-wise would be ~22% in favor of 7970. And this would also likely require a rewrite of the NVidia miners. As far as mining is concerned and as far as the TDP and prices announced until now are correct, there is no way 680 becomes a better alternative to 7970. Not even close.

Other / CPU/GPU Bitcoin mining hardware / Re: GK104: nVidia's Kepler to be the First Mining Card?

on: March 21, 2012, 05:48:42 PM

SHA1 is based on a Merkle-Damgard construction. It doesn't matter much whether you hash a lot of data once or a small amount of data multiple times. In both cases you are calculating multiple times the same compression function. Of course, there are some differences like hashing small amount of data can be more cache-friendly (with GPUs that could mean less __global reads). There are quite a lot of optimizations that can be done when input is fixed and small enough like what's the case with bitcoin. Anyway I believe that the ratios would be more or less the same provided that the graphs are correct. However, this does not take into consideration whether the code is optimized better for a specific platform, the quality of the drivers and the OpenCL stack (which does not perform as well as CUDA on NVidia).

Bitcoin / Mining software (miners) / Re: Why not utilize RAM in video cards?

on: March 19, 2012, 02:28:42 PM

Because off-chip video memory is incredibly slower as compared to registers and on-chip memory.

Bitcoin / Mining software (miners) / Re: DiaKGCN kernel for CGMINER + Phoenix 2 (79XX / 78XX / 77XX / GCN) - 2012-02-20

on: February 28, 2012, 02:54:31 AM

Great to see that. Be careful with speed calculations though, you might be calculating the MH/s based on that presumption that you are doing 4 nonces per workitem which is wrong.

Bitcoin / Mining software (miners) / Re: Phatk2 Mod (Already seeing improvement!)

on: February 12, 2012, 10:27:03 PM

Nothing. Earlier termination would cost you more than going through all the checks. You cannot use preprocessor directives for that because v and g are not known in compile time, so forget about #ifdef's, #else's, #define's and so on. I've seen such confusion from people that have been coding in interpreted languages mostly and recently switched to C. Anyway. If I were to search for improvements in the kernel (assuming I changed vector width to , perhaps the final checks is not the right place. If you have a look at the kernel, you'd notice that a lot of code has been "reordered" so that higher ALUPacking is achieved. For example sometimes several w[X] values are calculated in a row, sometimes it is done with each SHA256 round step. Another thing is order of operations in the macros, it is not random, I bet whoever coded it has profiled ALUPacking and chosen the best case. However, switching to uint8 would definitely break that. I believe you can get at least 1-2% performance improvement from tighter alupacking which is much more than what you'd get from saving several ALU ops in the final checks

Bitcoin / Mining software (miners) / Re: Phatk2 Mod (Already seeing improvement!)

on: February 12, 2012, 01:32:10 PM

No, you can't do that with predication.

Bitcoin / Mining software (miners) / Re: Phatk2 Mod (Already seeing improvement!)

on: February 12, 2012, 10:22:54 AM

prefetch is a noop on GPUs. It is useful in CPU kernels only to prefetch data in CPU cache (same as what _mm_prefetch() does). Quote if defined VECTORS4 (v.s0==g.s0) ? uint nonce = (W[3].s0); #endif : (); (v.s1==g.s1) ? uint nonce = (W[3].s1); #endif : (); (v.s2==g.s2) ? uint nonce = (W[3].s2); #endif : (); (v.s3==g.s3) ? uint nonce = (W[3].s3); #endif : (); ... #endif This is also not possible, it's an illegal construction that would fail the compilation. (v.s0==g.s0) is evaluated at run-time and the results are unknown to the preprocessor. If you need to terminate execution before write, you can just do that: if (!nonce) return; I am not sure it would make much of a difference though.

Bitcoin / Mining software (miners) / Re: Phatk2 GOFFSET Mod

on: February 10, 2012, 02:36:24 PM

I am kinda surprised that predication worked better than select(), usually it's just the opposite. Perhaps if you can send me both ISA dumps I can see what can be done to further improve that. For the second part: Quote u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]); u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)); Can we simplify these since they both contain (Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) ? It would certainly reduce calculations a bit. The only problem I see is Vals[1] and Vals[2] is inside of the parenthesis. Now, I'm not familiar with the comma symbolization here, but if the parenthesis can be put on the inside next to the ch(123), it's as easy as dividing by ((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) to remove it and make the math simpler for the GPU. I don't think it's worth trying. P.S I don't think ALU ops is a good performance metric. Of course that's important, but there are other factors. GPR usage and number of clauses is also very important, so you have to profile the overall result. I've seen many times situations where you have two kernels, one has a bit less ALU ops, other has just one more clause and the second one behaves much worse. Similarily, the situation with GPR usage. I am currently working on a RAR password cracking kernel and that poses some fucking paradoxes. For example I have several kernels, one keeping everything in __private memory with large GPR usage, another one that shifts some to __local memory and a third one that keeps a small lookup table in __global memory. Paradox is that the first one is the slowest, GPR usage is ~90, performance is disgusting. The one that keeps part of the data in __local memory behaves much better, 36 GPRs used, much better occupancy, but performance still not what I expected. The kernel that uses an intermediate __global memory buffer is currently the fastest one, mostly because of the cached global memory with SDK 2.6. It's twice faster than the second one and times faster than the first one. I would never expect that.

Bitcoin / Mining software (miners) / Re: Phatk2 GOFFSET Mod

on: February 10, 2012, 10:48:57 AM

Yes, this is without any branching (similar to alternative 2) from my previous post except that I had W[3] wrong. Basically the best would be to profile both and choose the faster one. With branching and without divergence, you have an additional clause (with divergence the penalty is worse as both "paths" would be serialized). However, without branching you introduce 7 dependent additions (can't pack them in two VLIW bundles as the result of the next addition depends on the previous one). I am not sure which would be faster. BTW for the scalar case, you don't need that: Code: #else v = select(W[3],(u)0,(v==g)); uint nonce = (v); as direct comparison might be faster, especially with predication. E.g: Code: nonce = (v==g) ? W[3] : 0; Unfortunately, this is not useful in the vector case. Of course you could try: Code: nonce = (v.s0==g.s0) ? W[3].s0 : nonce; nonce = (v.s1==g.s1) ? W[3].s1 : nonce; ... But that would generate much more inefficient code than that generated by using select(). Quote So, will having partial matches in a vector cause for any problems? The only problem is when you have more than one matching component pairs (v.sX and g.sX). For example v.s0==g.s0 and v.s3==g.s3. The version with branches would eventually have one of the two nonces written correctly in the output buffer (namely W[3].s3), the version with select() would have the wrong nonce written in the output buffer (W[3].s0+W[3].s3).

Bitcoin / Mining software (miners) / Re: Phatk2 GOFFSET Mod

on: February 10, 2012, 08:59:44 AM

I am compiling that with clcc and it builds successfully. It should work when you have VECTORS8 defined because you have: Code: #ifdef VECTORS8 typedef uint8 u; and eq is defined to be of type u. Quote So it takes that match and makes it nonce. However, it continues through the if statements as though it were looking for another match. So, either A) nonce needs to be increased in size to hold multiple equivalent vectors or B) the if statements needs to be stopped once a suitable nonce is found otherwise it will only serve to overwrite the first. Yes, that's a valid point, but having nonce as a vector means you should be also increasing the output buffer (vector width) times. This in turn means you'd need (vector width) times larger device-host transfers. People with underclocked GPU memory and PCIe extenders won't be very happy about that Quote B) the if statements needs to be stopped once a suitable nonce is found otherwise it will only serve to overwrite the first. Yep, that's the purpose of replacing branches with select() Quote And I think you may be misusing select there. You see, we need to pull apart v and g into it's separate parts before figuring out which parts are equal. That is, unless we can xor v and g, pull apart v and then write any vector = 0 from the equivalent g vector. No, you don't need to do that. The result of (v==g) is a vector where each component is 0 if the corresponding v and g components are equal. E.g you have: v = (uint8)(5,5,5,5,5,5,5,5); g = (uint8)(1,2,3,4,5,6,7,8); (v==g) would be (0,0,0,0,1,0,0,0) This is still not useful as nonce is a scalar value. Then also (I noticed that later and corrected it) nonce should equal the matching vector element from W[3], not v or g. Thus, this is the most straightforward solution: eq = select(W[3],(u)0,(v==g)) What's the idea? eq is a vector, same width as W[3], v and g. Let W[3] contain (0x10,0x20,0x30,0x40,0x50,0x60,0x70,0x80) eq would contain (0,0,0,0,0x50,0,0,0) since we need a scalar nonce, we just sum all the elements of eq and get 0+0+0+0+0x50+0+0+0 = 0x50. Of course this would break if we have more than one match between v and g components and in that case the nonce would be wrong. The probability for this is low but it could happen. This is the worst case. Overall though, I think performance improvement due to branches elimination is worth the increased percentage of wrong shares. Also, a quick check on host could prevent the miner from submitting the wrong share. And as I said, this should occur rarely.

Bitcoin / Mining software (miners) / Re: Phatk2 GOFFSET Mod

on: February 09, 2012, 01:32:53 PM

OK, try this way: Code: #ifdef VECTORS8 if (any(v==g)) { u eq = select(W[3],(u)0,(v==g)); nonce = (eq.s0+eq.s1+eq.s2+eq.s3+eq.s4+eq.s5+eq.s6+eq.s7); } #elif defined VECTORS4

Pages: [1] 2 3 4 5 6 7 8 9 10 11 12 13 »