Bitcoin Forum
April 24, 2024, 08:05:59 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 ... 1135 »
  Print  
Author Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX]  (Read 3426868 times)
peacefulmind
Full Member
***
Offline Offline

Activity: 196
Merit: 100


View Profile
April 22, 2013, 05:34:53 AM
Last edit: April 22, 2013, 06:33:43 AM by peacefulmind
 #441

Christian,

Success,

copied from settings above but seems to be only 260kH/s per TITAN.

Quote from: FrictionlessCoin
"I think you are to hung up on this notion about 'pre-mining' being a No-No."
- from journeys into the dark depths of the alt coin forum....
If you see garbage posts (off-topic, trolling, spam, no point, etc.), use the "report to moderator" links. All reports are investigated, though you will rarely be contacted about your reports.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1713989159
Hero Member
*
Offline Offline

Posts: 1713989159

View Profile Personal Message (Offline)

Ignore
1713989159
Reply with quote  #2

1713989159
Report to moderator
1713989159
Hero Member
*
Offline Offline

Posts: 1713989159

View Profile Personal Message (Offline)

Ignore
1713989159
Reply with quote  #2

1713989159
Report to moderator
peacefulmind
Full Member
***
Offline Offline

Activity: 196
Merit: 100


View Profile
April 22, 2013, 07:54:06 AM
 #442

A 570 though would be significantly faster (but also run significantly hotter). I am still trying to understand why the Kepler architecture has such a performance disadvantage with my current code.

I did try some inline PTX assembly (looks horrid, check it out)

Code:
__device__ void ROTL7(uint32_t &A0, const uint32_t &A1, const uint32_t &A2,
                      uint32_t &B0, const uint32_t &B1, const uint32_t &B2,
                      uint32_t &C0, const uint32_t &C1, const uint32_t &C2,
                      uint32_t &D0, const uint32_t &D1, const uint32_t &D2)
{
    asm("{\n\t"
    "  .reg .u32 tA1, tA2;\n\t"
    "  .reg .u32 tB1, tB2;\n\t"
    "  .reg .u32 tC1, tC2;\n\t"
    "  .reg .u32 tD1, tD2;\n\t"
    "  add.u32 tA1, %4, %5;\n\t"
    "  add.u32 tB1, %6, %7;\n\t"
    "  add.u32 tC1, %8, %9;\n\t"
    "  add.u32 tD1, %10, %11;\n\t"
    "  shl.b32 tA2, tA1, 7;\n\t"
    "  shl.b32 tB2, tB1, 7;\n\t"
    "  shl.b32 tC2, tC1, 7;\n\t"
    "  shl.b32 tD2, tD1, 7;\n\t"
    "  shr.b32 tA1, tA1, 25;\n\t"
    "  shr.b32 tB1, tB1, 25;\n\t"
    "  shr.b32 tC1, tC1, 25;\n\t"
    "  shr.b32 tD1, tD1, 25;\n\t"
    "  or.b32 tA1, tA1, tA2;\n\t"
    "  or.b32 tB1, tB1, tB2;\n\t"
    "  or.b32 tC1, tC1, tC2;\n\t"
    "  or.b32 tD1, tD1, tD2;\n\t"
    "  xor.b32 %0, %0, tA1;\n\t"
    "  xor.b32 %1, %1, tB1;\n\t"
    "  xor.b32 %2, %2, tC1;\n\t"
    "  xor.b32 %3, %3, tD1;\n\t"
    "}"
    : "+r"(A0), "+r"(B0), "+r"(C0), "+r"(D0) : "r" (A1), "r" (A2), "r" (B1), "r" (B2), "r" (C1), "r" (C2), "r" (D1), "r" (D2));
}

as well as added instruction level parallelism by formulating the CUDA code like this:

Code:
#define ROTL7(A0, A1, A2, B0, B1, B2, C0, C1, C2, D0, D1, D2)  \
{\
    volatile uint32_t tA1 = A1 + A2, tB1 = B1 + B2, tC1 = C1 + C2, tD1 = D1 + D2;\
    volatile uint32_t tA2 = tA1<< 7, tB2 = tB1<< 7, tC2 = tC1<< 7, tD2 = tD1<< 7;\
                      tA1 = tA1>>25; tB1 = tB1>>25; tC1 = tC1>>25; tD1 = tD1>>25;\
                      tA2|= tA1    ; tB2|= tB1    ; tC2|= tC1    ; tD2|= tD1    ;\
                      A0 ^= tA2    ; B0 ^= tB2    ; C0 ^= tC2    ; D0 ^= tD2    ;\
}

but actually I couldn't get performance above what is already achieved. So in case you're wondering why there haven't been any updates. That is because my experiments in getting more speed haven't been fruitful yet.



TITAN Update -

Both Titans - 980Mhz Core, 6286Mhz Effective memory, 106% power target.



Both together ~500-530kH/s.  Seems a Titan/K20 ought to do more in theory. Would love to see EACH at 500kH!

I really appreciate that this NVIDIA development may help increase LTC interest and adoption!

Quote from: FrictionlessCoin
"I think you are to hung up on this notion about 'pre-mining' being a No-No."
- from journeys into the dark depths of the alt coin forum....
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
April 22, 2013, 08:49:34 AM
Last edit: April 22, 2013, 11:06:30 AM by cbuchner1
 #443

I've just run into the same compiler issue that borked the Titan kernels when I tried to compile salsa_kernel.cu for sm_30. The kernel will just crash.

Maybe using the NSight debugger I can figure out why this occurs.



cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
April 22, 2013, 11:07:46 AM
 #444


I've seen reports of a single overclocked Titan doing 290 kHash/s, using a somewhat earlier code version.


cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
April 22, 2013, 11:10:28 AM
Last edit: April 22, 2013, 11:43:39 AM by cbuchner1
 #445

I assume you've seen this Kepler thread?

https://bitcointalk.org/index.php?topic=163750.0;topicseen

Seen this.

The challenges with the scrypt hashing are a bit greater than just using the funnel shifter for rotation. One issue is the speed and efficiency of memory access, the other issue is getting enough occupancy on Kepler's SMX (multiprocessor) units - shared memory and register limits are an issue. This mainly affects the GTX 660Ti, GTX 670, 680 and Titan devices which currently perform rather poor in comparison to the 5xx series.
K1773R
Legendary
*
Offline Offline

Activity: 1792
Merit: 1008


/dev/null


View Profile
April 22, 2013, 11:47:10 AM
 #446

how much are you guys getting with a 580?

[GPG Public Key]
BTC/DVC/TRC/FRC: 1K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM AK1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: NK1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 LTC: LKi773RBuPepQH8E6Zb1ponoCvgbU7hHmd EMC: EK1773RxUes1HX1YAGMZ1xVYBBRUCqfDoF BQC: bK1773R1APJz4yTgRkmdKQhjhiMyQpJgfN
SubNoize
Newbie
*
Offline Offline

Activity: 47
Merit: 0


View Profile
April 22, 2013, 12:13:30 PM
 #447

how much are you guys getting with a 580?

240KH/s give or take 10KH/s
K1773R
Legendary
*
Offline Offline

Activity: 1792
Merit: 1008


/dev/null


View Profile
April 22, 2013, 12:34:44 PM
 #448

how much are you guys getting with a 580?

240KH/s give or take 10KH/s
sweet, i got ~257 Smiley (slightly OC)
as soon ive mined some coins il send a donation for sure Wink

[GPG Public Key]
BTC/DVC/TRC/FRC: 1K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM AK1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: NK1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 LTC: LKi773RBuPepQH8E6Zb1ponoCvgbU7hHmd EMC: EK1773RxUes1HX1YAGMZ1xVYBBRUCqfDoF BQC: bK1773R1APJz4yTgRkmdKQhjhiMyQpJgfN
Misiolap
Newbie
*
Offline Offline

Activity: 14
Merit: 0


View Profile
April 22, 2013, 12:42:05 PM
 #449

I've just run into the same compiler issue that borked the Titan kernels when I tried to compile salsa_kernel.cu for sm_30. The kernel will just crash.

Maybe using the NSight debugger I can figure out why this occurs.

Does the crash produce: CUDA_EXCEPTION_6, Warp Misaligned Address ?

I've been able to compile & run salsa_kernel for sm_21, without tex-cache, when accesses to X variable are 128-bit aligned,

ie. when it's declared like this:
Code:
_shared__ uint32_t X[WARPS_PER_BLOCK][WU_PER_WARP][16+4];
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
April 22, 2013, 01:28:48 PM
 #450

_shared__ uint32_t X[WARPS_PER_BLOCK][WU_PER_WARP][16+4];

Thanks! This helped. I did not know about newly added alignment restrictions in shared memory targeting SM 2.0 and higher. I guess that's because they're now having a unified pointer and addressing scheme. So if there's an alignment requirement, it applies to everything.

Finally the Titan kernel will get my large memory transaction fixes, which should boost performance notably.

Christian
Aggrophobia
Legendary
*
Offline Offline

Activity: 1064
Merit: 1000



View Profile
April 22, 2013, 02:22:47 PM
Last edit: April 22, 2013, 02:44:06 PM by Aggrophobia
 #451

autoadjust does not find the best values for my titan, had to find the best values Sad


e: now i checked -D option
    it works with 70x4 280khash/s
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
April 22, 2013, 02:50:06 PM
 #452

autoadjust does not find the best values for my titan, had to find the best values Sad
it works with 70x4 280khash/s

it's autotune (TM) (R).

how's 35x8 ?

Christian
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
April 22, 2013, 07:55:22 PM
 #453

Posted an April 22nd release.

Please let me know how it compiles on Linux 64 bit, and how it performs on Titan now.

The patch posted earlier wasn't really doing things right. CUDA textures should have stayed ulong2 and ulong4 type, but the uint32_t type needed to be moved over to unsigned long (from unsigned int previously) because otherwise there would be a mismatch with the texture types.


cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
April 22, 2013, 08:20:09 PM
 #454


hmm, the patch posted earlier suggests the following configure line for 64 bits

./configure "CFLAGS=-O3" "CXXFLAGS=-O3" "LDFLAGS=-Wl,-O1" --with-cuda=/usr/local/cuda

not sure what the -Wl,-O1 linker flag is supposed to do.
Misiolap
Newbie
*
Offline Offline

Activity: 14
Merit: 0


View Profile
April 22, 2013, 08:26:04 PM
 #455

My mistake, it shouldn't be there - at the moment -O1 for ld only turns on some optimizations for shared libraries, not the program binary.
dbabo
Newbie
*
Offline Offline

Activity: 41
Merit: 0


View Profile
April 22, 2013, 08:26:29 PM
 #456

Posted an April 22nd release.

Please let me know how it compiles on Linux 64 bit, and how it performs on Titan now.

The patch posted earlier wasn't really doing things right. CUDA textures should have stayed ulong2 and ulong4 type, but the uint32_t type needed to be moved over to unsigned long (from unsigned int previously) because otherwise there would be a mismatch with the texture types.




Christian,
configure works fine if i run:
./configure -with-cuda=/usr/local/cuda

instead of ./configure.sh

And it almost compiles - http://pastebin.com/raw.php?i=JZb62Jtd
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
April 22, 2013, 08:29:08 PM
 #457


That doesn't qualify as almost! Wink

dbabo
Newbie
*
Offline Offline

Activity: 41
Merit: 0


View Profile
April 22, 2013, 08:36:08 PM
 #458


That doesn't qualify as almost! Wink


xa-xa close enough. I think i observed same errors before the patch. so it 9hopefully) something simple.
dbabo
Newbie
*
Offline Offline

Activity: 41
Merit: 0


View Profile
April 22, 2013, 08:37:49 PM
 #459

My mistake, it shouldn't be there - at the moment -O1 for ld only turns on some optimizations for shared libraries, not the program binary.

-O3 takes whopping 5Khs/ out of mine super fast GT460 Smiley
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
April 22, 2013, 08:41:31 PM
 #460


I am like so close -----> <----- to throwing out the texture cache support in 64 bit builds.


Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 ... 1135 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!