peacefulmind
|
 |
April 22, 2013, 05:34:53 AM Last edit: April 22, 2013, 06:33:43 AM by peacefulmind |
|
Christian,
Success,
copied from settings above but seems to be only 260kH/s per TITAN.
|
"I think you are to hung up on this notion about 'pre-mining' being a No-No." - from journeys into the dark depths of the alt coin forum....
|
|
|
|
|
|
|
|
"With e-currency based on cryptographic proof, without the need to
trust a third party middleman, money can be secure and transactions
effortless." -- Satoshi
|
|
|
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
|
|
peacefulmind
|
 |
April 22, 2013, 07:54:06 AM |
|
A 570 though would be significantly faster (but also run significantly hotter). I am still trying to understand why the Kepler architecture has such a performance disadvantage with my current code. I did try some inline PTX assembly (looks horrid, check it out) __device__ void ROTL7(uint32_t &A0, const uint32_t &A1, const uint32_t &A2, uint32_t &B0, const uint32_t &B1, const uint32_t &B2, uint32_t &C0, const uint32_t &C1, const uint32_t &C2, uint32_t &D0, const uint32_t &D1, const uint32_t &D2) { asm("{\n\t" " .reg .u32 tA1, tA2;\n\t" " .reg .u32 tB1, tB2;\n\t" " .reg .u32 tC1, tC2;\n\t" " .reg .u32 tD1, tD2;\n\t" " add.u32 tA1, %4, %5;\n\t" " add.u32 tB1, %6, %7;\n\t" " add.u32 tC1, %8, %9;\n\t" " add.u32 tD1, %10, %11;\n\t" " shl.b32 tA2, tA1, 7;\n\t" " shl.b32 tB2, tB1, 7;\n\t" " shl.b32 tC2, tC1, 7;\n\t" " shl.b32 tD2, tD1, 7;\n\t" " shr.b32 tA1, tA1, 25;\n\t" " shr.b32 tB1, tB1, 25;\n\t" " shr.b32 tC1, tC1, 25;\n\t" " shr.b32 tD1, tD1, 25;\n\t" " or.b32 tA1, tA1, tA2;\n\t" " or.b32 tB1, tB1, tB2;\n\t" " or.b32 tC1, tC1, tC2;\n\t" " or.b32 tD1, tD1, tD2;\n\t" " xor.b32 %0, %0, tA1;\n\t" " xor.b32 %1, %1, tB1;\n\t" " xor.b32 %2, %2, tC1;\n\t" " xor.b32 %3, %3, tD1;\n\t" "}" : "+r"(A0), "+r"(B0), "+r"(C0), "+r"(D0) : "r" (A1), "r" (A2), "r" (B1), "r" (B2), "r" (C1), "r" (C2), "r" (D1), "r" (D2)); }
as well as added instruction level parallelism by formulating the CUDA code like this: #define ROTL7(A0, A1, A2, B0, B1, B2, C0, C1, C2, D0, D1, D2) \ {\ volatile uint32_t tA1 = A1 + A2, tB1 = B1 + B2, tC1 = C1 + C2, tD1 = D1 + D2;\ volatile uint32_t tA2 = tA1<< 7, tB2 = tB1<< 7, tC2 = tC1<< 7, tD2 = tD1<< 7;\ tA1 = tA1>>25; tB1 = tB1>>25; tC1 = tC1>>25; tD1 = tD1>>25;\ tA2|= tA1 ; tB2|= tB1 ; tC2|= tC1 ; tD2|= tD1 ;\ A0 ^= tA2 ; B0 ^= tB2 ; C0 ^= tC2 ; D0 ^= tD2 ;\ }
but actually I couldn't get performance above what is already achieved. So in case you're wondering why there haven't been any updates. That is because my experiments in getting more speed haven't been fruitful yet. TITAN Update - Both Titans - 980Mhz Core, 6286Mhz Effective memory, 106% power target.  Both together ~500-530kH/s. Seems a Titan/K20 ought to do more in theory. Would love to see EACH at 500kH! I really appreciate that this NVIDIA development may help increase LTC interest and adoption!
|
"I think you are to hung up on this notion about 'pre-mining' being a No-No." - from journeys into the dark depths of the alt coin forum....
|
|
|
cbuchner1 (OP)
|
 |
April 22, 2013, 08:49:34 AM Last edit: April 22, 2013, 11:06:30 AM by cbuchner1 |
|
I've just run into the same compiler issue that borked the Titan kernels when I tried to compile salsa_kernel.cu for sm_30. The kernel will just crash.
Maybe using the NSight debugger I can figure out why this occurs.
|
|
|
|
cbuchner1 (OP)
|
 |
April 22, 2013, 11:07:46 AM |
|
I've seen reports of a single overclocked Titan doing 290 kHash/s, using a somewhat earlier code version.
|
|
|
|
cbuchner1 (OP)
|
 |
April 22, 2013, 11:10:28 AM Last edit: April 22, 2013, 11:43:39 AM by cbuchner1 |
|
Seen this. The challenges with the scrypt hashing are a bit greater than just using the funnel shifter for rotation. One issue is the speed and efficiency of memory access, the other issue is getting enough occupancy on Kepler's SMX (multiprocessor) units - shared memory and register limits are an issue. This mainly affects the GTX 660Ti, GTX 670, 680 and Titan devices which currently perform rather poor in comparison to the 5xx series.
|
|
|
|
K1773R
Legendary
Offline
Activity: 1792
Merit: 1008
/dev/null
|
 |
April 22, 2013, 11:47:10 AM |
|
how much are you guys getting with a 580?
|
[GPG Public Key]BTC/DVC/TRC/FRC: 1 K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM A K1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: N K1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 LTC: L Ki773RBuPepQH8E6Zb1ponoCvgbU7hHmd EMC: E K1773RxUes1HX1YAGMZ1xVYBBRUCqfDoF BQC: b K1773R1APJz4yTgRkmdKQhjhiMyQpJgfN
|
|
|
SubNoize
Newbie
Offline
Activity: 47
Merit: 0
|
 |
April 22, 2013, 12:13:30 PM |
|
how much are you guys getting with a 580?
240KH/s give or take 10KH/s
|
|
|
|
K1773R
Legendary
Offline
Activity: 1792
Merit: 1008
/dev/null
|
 |
April 22, 2013, 12:34:44 PM |
|
how much are you guys getting with a 580?
240KH/s give or take 10KH/s sweet, i got ~257  (slightly OC) as soon ive mined some coins il send a donation for sure 
|
[GPG Public Key]BTC/DVC/TRC/FRC: 1 K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM A K1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: N K1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 LTC: L Ki773RBuPepQH8E6Zb1ponoCvgbU7hHmd EMC: E K1773RxUes1HX1YAGMZ1xVYBBRUCqfDoF BQC: b K1773R1APJz4yTgRkmdKQhjhiMyQpJgfN
|
|
|
Misiolap
Newbie
Offline
Activity: 14
Merit: 0
|
 |
April 22, 2013, 12:42:05 PM |
|
I've just run into the same compiler issue that borked the Titan kernels when I tried to compile salsa_kernel.cu for sm_30. The kernel will just crash.
Maybe using the NSight debugger I can figure out why this occurs.
Does the crash produce: CUDA_EXCEPTION_6, Warp Misaligned Address ? I've been able to compile & run salsa_kernel for sm_21, without tex-cache, when accesses to X variable are 128-bit aligned, ie. when it's declared like this: _shared__ uint32_t X[WARPS_PER_BLOCK][WU_PER_WARP][16+4];
|
|
|
|
cbuchner1 (OP)
|
 |
April 22, 2013, 01:28:48 PM |
|
_shared__ uint32_t X[WARPS_PER_BLOCK][WU_PER_WARP][16+4];
Thanks! This helped. I did not know about newly added alignment restrictions in shared memory targeting SM 2.0 and higher. I guess that's because they're now having a unified pointer and addressing scheme. So if there's an alignment requirement, it applies to everything. Finally the Titan kernel will get my large memory transaction fixes, which should boost performance notably. Christian
|
|
|
|
Aggrophobia
Legendary
Offline
Activity: 1064
Merit: 1000
|
 |
April 22, 2013, 02:22:47 PM Last edit: April 22, 2013, 02:44:06 PM by Aggrophobia |
|
autoadjust does not find the best values for my titan, had to find the best values  e: now i checked -D option it works with 70x4 280khash/s
|
|
|
|
cbuchner1 (OP)
|
 |
April 22, 2013, 02:50:06 PM |
|
autoadjust does not find the best values for my titan, had to find the best values  it works with 70x4 280khash/s it's autotune (TM) (R). how's 35x8 ? Christian
|
|
|
|
cbuchner1 (OP)
|
 |
April 22, 2013, 07:55:22 PM |
|
Posted an April 22nd release.
Please let me know how it compiles on Linux 64 bit, and how it performs on Titan now.
The patch posted earlier wasn't really doing things right. CUDA textures should have stayed ulong2 and ulong4 type, but the uint32_t type needed to be moved over to unsigned long (from unsigned int previously) because otherwise there would be a mismatch with the texture types.
|
|
|
|
cbuchner1 (OP)
|
 |
April 22, 2013, 08:20:09 PM |
|
hmm, the patch posted earlier suggests the following configure line for 64 bits
./configure "CFLAGS=-O3" "CXXFLAGS=-O3" "LDFLAGS=-Wl,-O1" --with-cuda=/usr/local/cuda
not sure what the -Wl,-O1 linker flag is supposed to do.
|
|
|
|
Misiolap
Newbie
Offline
Activity: 14
Merit: 0
|
 |
April 22, 2013, 08:26:04 PM |
|
My mistake, it shouldn't be there - at the moment -O1 for ld only turns on some optimizations for shared libraries, not the program binary.
|
|
|
|
dbabo
Newbie
Offline
Activity: 41
Merit: 0
|
 |
April 22, 2013, 08:26:29 PM |
|
Posted an April 22nd release.
Please let me know how it compiles on Linux 64 bit, and how it performs on Titan now.
The patch posted earlier wasn't really doing things right. CUDA textures should have stayed ulong2 and ulong4 type, but the uint32_t type needed to be moved over to unsigned long (from unsigned int previously) because otherwise there would be a mismatch with the texture types.
Christian, configure works fine if i run: ./configure -with-cuda=/usr/local/cuda instead of ./configure.sh And it almost compiles - http://pastebin.com/raw.php?i=JZb62Jtd
|
|
|
|
cbuchner1 (OP)
|
 |
April 22, 2013, 08:29:08 PM |
|
That doesn't qualify as almost! 
|
|
|
|
dbabo
Newbie
Offline
Activity: 41
Merit: 0
|
 |
April 22, 2013, 08:36:08 PM |
|
That doesn't qualify as almost!  xa-xa close enough. I think i observed same errors before the patch. so it 9hopefully) something simple.
|
|
|
|
dbabo
Newbie
Offline
Activity: 41
Merit: 0
|
 |
April 22, 2013, 08:37:49 PM |
|
My mistake, it shouldn't be there - at the moment -O1 for ld only turns on some optimizations for shared libraries, not the program binary.
-O3 takes whopping 5Khs/ out of mine super fast GT460 
|
|
|
|
cbuchner1 (OP)
|
 |
April 22, 2013, 08:41:31 PM |
|
I am like so close -----> <----- to throwing out the texture cache support in 64 bit builds.
|
|
|
|
|