Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 25, 2011, 08:23:29 PM |
|
I tested my miner for LTC an hour ago. The main task was to adjust a ratio of hashes per GHz. I lowered it to 1.87 (1870 hashes per 1 GHz). And it was 1.8-1.9 on most machines. But 1 computer had 2.1 ratio and I noticed no difference in hardware if compare to other machines. That puzzled me. Could u look at this screenshot and tell me what is so special in this machine? I have no ideas.
|
|
|
|
0ni0ns
Newbie
Offline
Activity: 40
Merit: 0
|
|
December 25, 2011, 10:14:00 PM |
|
Instruction set/pipelining/instructions per cycle are just as important as GHz when looking at CPUs... What were the other ones? The i5 probably takes less cycles per hash.
|
|
|
|
ssvb
Newbie
Offline
Activity: 39
Merit: 0
|
|
December 25, 2011, 10:51:43 PM |
|
I tested my miner for LTC an hour ago. The main task was to adjust a ratio of hashes per GHz. I lowered it to 1.87 (1870 hashes per 1 GHz). And it was 1.8-1.9 on most machines. But 1 computer had 2.1 ratio and I noticed no difference in hardware if compare to other machines. That puzzled me. Could u look at this screenshot and tell me what is so special in this machine? I have no ideas. It could be also the effect of turbo boost. Can you share any additional details about your implementation? Are you calculating more than one hash in parallel? If doing so, it is important for the working set not to exceed L2 cache size per CPU core. Right now with my own tweaks coincidentally developed just today (for now using intrinsics for prototyping purposes), I'm getting ~3.4 khash/s for calculating one hash at once and ~5.1 khash/s when calculating two hashes at once when running a single thread on Intel Core i7 860 @2.8GHz (turbo boost disabled). The hand tuned pooler's assembly code runs at ~4.0 khash/s in the same conditions on my machine. But if hyperthreading comes into action, it ruins everything For total 8 threads both my and pooler's implementations get roughly the same ~3 khash/s per thread. Two hashes at once use double working set (~256K of memory) and that's exactly the size of L2 cache. There is no room for the second hardware thread on the same core and hyperthreading is apparently thrashing L2 cache, killing the performance benefits. Converting the code to hand tuned assembly may turn the tables though. BTW, your sandybridge cpu should also have an advantage per mhz over my nehalem because it does not suffer from register read stalls which are a pest under registers pressure.
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 26, 2011, 05:31:29 AM |
|
Instruction set/pipelining/instructions per cycle are just as important as GHz when looking at CPUs... What were the other ones? The i5 probably takes less cycles per hash.
I recall there were no i5. Only this one...
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 26, 2011, 05:37:57 AM |
|
It could be also the effect of turbo boost. Can you share any additional details about your implementation? Are you calculating more than one hash in parallel? If doing so, it is important for the working set not to exceed L2 cache size per CPU core. Right now with my own tweaks coincidentally developed just today (for now using intrinsics for prototyping purposes), I'm getting ~3.4 khash/s for calculating one hash at once and ~5.1 khash/s when calculating two hashes at once when running a single thread on Intel Core i7 860 @2.8GHz (turbo boost disabled). The hand tuned pooler's assembly code runs at ~4.0 khash/s in the same conditions on my machine. But if hyperthreading comes into action, it ruins everything For total 8 threads both my and pooler's implementations get roughly the same ~3 khash/s per thread. Two hashes at once use double working set (~256K of memory) and that's exactly the size of L2 cache. There is no room for the second hardware thread on the same core and hyperthreading is apparently thrashing L2 cache, killing the performance benefits. Converting the code to hand tuned assembly may turn the tables though. BTW, your sandybridge cpu should also have an advantage per mhz over my nehalem because it does not suffer from register read stalls which are a pest under registers pressure. I calculate 2 hashes at once to avoid register/memory read stalls (xmm0-xmm3 for 1st hash and xmm4-xmm7 for 2nd one). So there was something else that increased performance. After u told about cache I noticed that computer from the picture had very big cache size, 4 x 32 instead of 2 x 32 like it was on other machines. I suspect that cache size is more important.
|
|
|
|
ssvb
Newbie
Offline
Activity: 39
Merit: 0
|
|
December 26, 2011, 07:54:52 AM |
|
I calculate 2 hashes at once to avoid register/memory read stalls (xmm0-xmm3 for 1st hash and xmm4-xmm7 for 2nd one). OK, this explains why we get similar performance improvements (using pooler's code as a baseline for comparison). Do you have any other trick in your sleeve? Of course unless you want to keep it secret Anyway, I got an impressions that you managed to somehow simplify the algorithm (for N = 1024, r = 1, p = 1 configuration) and reduce the number of arithmetic operations based on the information from this post. So there was something else that increased performance. My guess is that it's because sandybridge is a nice microarchitecture improvement over previous processors (eliminated register read stalls and instruction decoder bottlenecks). Agner Fog explains it quite well. Various benchmarkers/reviewers also have confirmed better performance per MHz. After u told about cache I noticed that computer from the picture had very big cache size, 4 x 32 instead of 2 x 32 like it was on other machines. I suspect that cache size is more important. I think 4 x 32 just means four cores with 32K of L1 each. The other processor was likely dualcore. 32K much smaller than 128K needed for hash calculation, which means that L2 cache plays a more significant role. Because each core has 256K of L2 cache, parallel computation of two caches at once makes a lot of sense at least for the processors without hyperthreading. What are your plans regarding you miner?
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 26, 2011, 08:13:45 AM |
|
I calculate 2 hashes at once to avoid register/memory read stalls (xmm0-xmm3 for 1st hash and xmm4-xmm7 for 2nd one). OK, this explains why we get similar performance improvements (using pooler's code as a baseline for comparison). Do you have any other trick in your sleeve? Of course unless you want to keep it secret Anyway, I got an impressions that you managed to somehow simplify the algorithm (for N = 1024, r = 1, p = 1 configuration) and reduce the number of arithmetic operations based on the information from this post. So there was something else that increased performance. My guess is that it's because sandybridge is a nice microarchitecture improvement over previous processors (eliminated register read stalls and instruction decoder bottlenecks). Agner Fog explains it quite well. Various benchmarkers/reviewers also have confirmed better performance per MHz. After u told about cache I noticed that computer from the picture had very big cache size, 4 x 32 instead of 2 x 32 like it was on other machines. I suspect that cache size is more important. I think 4 x 32 just means four cores with 32K of L1 each. The other processor was likely dualcore. 32K much smaller than 128K needed for hash calculation, which means that L2 cache plays a more significant role. Because each core has 256K of L2 cache, parallel computation of two caches at once makes a lot of sense at least for the processors without hyperthreading. What are your plans regarding you miner? Yes, I removed some redundant calculations from salsa (not all though, I use them to tweak the ratio per GHz). It's the only trick I use. All others are just standard optimization techniques. I'd like to publish my miner but I wish to have it bonded to a particular pool, so now I'm looking for a pool owner who could cooperate with me. Do u know anyone btw? Also I'm not going to create Linux or MacOS editions of the miner to avoid situations when all miners (human) will use only my miner (soft).
|
|
|
|
ssvb
Newbie
Offline
Activity: 39
Merit: 0
|
|
December 26, 2011, 02:29:05 PM |
|
Yes, I removed some redundant calculations from salsa (not all though, I use them to tweak the ratio per GHz). It's the only trick I use. All others are just standard optimization techniques. I'm not sure if I fully understand this part. What does "tweak the ratio per GHz" mean? I'd like to publish my miner but I wish to have it bonded to a particular pool, so now I'm looking for a pool owner who could cooperate with me. Do u know anyone btw? Also I'm not going to create Linux or MacOS editions of the miner to avoid situations when all miners (human) will use only my miner (soft). Have you noticed that pooler has also updated his miner and introduced processing of two hashes at once? And actually his commit at github seems to be 2 days old already. Does your miner have any speed advantage over it? After all, your benchmark numbers look pretty normal for "two hashes at once" processing. How much does the algorithmic optimization actually contribute?
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 26, 2011, 04:03:16 PM |
|
>> I'm not sure if I fully understand this part. What does "tweak the ratio per GHz" mean?
When i said "tweak" i meant "lowering". Yes, i lower mining rate, coz there is no need to publish very fast miner. It's enough for me if it's just 10% faster than any other miner.
>> Have you noticed that pooler has also updated his miner and introduced processing of two hashes at once?
No, i didn't know that. If he didn't follow my advice to calculate 2 hashes in maneur when instructions are doubled to avoid cpu stalls, then he still have a trick to use.
>> Does your miner have any speed advantage over it?
Current version calculates 2000 hashes per 1 GHz. If pooler's miner has the same ratio, then there is no any speed advantage using my miner. I can't test it by myself coz Windows version of pooler's miner crashes on my computer.
>> After all, your benchmark numbers look pretty normal for "two hashes at once" processing.
Yes, seems so.
>> How much does the algorithmic optimization actually contribute?
Not much, when i remove all unnecessary code i have 2500 hashes per 1 GHz.
|
|
|
|
ineededausername
|
|
December 27, 2011, 11:49:12 PM |
|
When i said "tweak" i meant "lowering". Yes, i lower mining rate, coz there is no need to publish very fast miner. It's enough for me if it's just 10% faster than any other miner.
Why would you lower it when you can make it better and when you're going to give it out for free anyways? That seems really counterintuitive.
|
(BFL)^2 < 0
|
|
|
Graet
VIP
Legendary
Offline
Activity: 980
Merit: 1001
|
|
December 28, 2011, 09:26:55 AM |
|
I'd like to publish my miner but I wish to have it bonded to a particular pool, so now I'm looking for a pool owner who could cooperate with me. Do u know anyone btw? Also I'm not going to create Linux or MacOS editions of the miner to avoid situations when all miners (human) will use only my miner (soft).
why "bond to a pool" and not allow to be used by all all of the pools have threads in this forum, you could post there or Private message one? (they dont hide) in my experience - even when there is an excellent miner (soft) many users (for wehatever reason) will continue to use the one they have or other miners they think work better for them, if your software is truly revolutionary someone will port it - even if you dont
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 28, 2011, 03:35:00 PM |
|
When i said "tweak" i meant "lowering". Yes, i lower mining rate, coz there is no need to publish very fast miner. It's enough for me if it's just 10% faster than any other miner.
Why would you lower it when you can make it better and when you're going to give it out for free anyways? That seems really counterintuitive. If everyone uses a fast miner this is the same as everyone uses a slow one coz of DIFFICULTY adjusting. So no point to give it to everyone.
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
December 28, 2011, 03:37:56 PM |
|
why "bond to a pool" and not allow to be used by all
Just to earn some money for beer.
|
|
|
|
bitlane
Internet detective
Sr. Member
Offline
Activity: 462
Merit: 250
I heart thebaron
|
|
January 19, 2012, 07:26:42 PM |
|
If everyone uses a fast miner this is the same as everyone uses a slow one coz of DIFFICULTY adjusting. So no point to give it to everyone.
Why not offer it up for sale then ? require a key or some sort of license validation and make some cash from it. There is a pretty fancy BTC Mining Utility on this forum that is the flashiest I have ever seen and looks awesome, but it is also tied to a pool and unfortunately, that pool is not as popular as I am sure the Pool owner would like it to be......regardless of his awesome performing/looking software. The 'Pool Binding' idea is bad....just sell it.
|
|
|
|
|
Come-from-Beyond (OP)
Legendary
Offline
Activity: 2142
Merit: 1010
Newbie
|
|
January 20, 2012, 08:05:26 PM |
|
Thx, but I switched to other project.
|
|
|
|
|