Bitcoin Forum
April 16, 2014, 08:44:24 AM *
News: ♦♦ A bug in OpenSSL, used by Bitcoin-Qt/Bitcoin Core, could allow your bitcoins to be stolen. Immediately updating Bitcoin Core to 0.9.1 is required in some cases, especially if you're using 0.9.0. Download. More info.
The same bug also affected the forum. Changing your forum password is recommended.
 
   Home   Help Search Donate Login Register  
Pages: [1] 2 3 4 5 6 7 8 9 10 11  All
  Print  
Author Topic: NVIDIA Kepler (K20) from 134MHash/s to 330MHash/s with CUDA  (Read 44156 times)
psychocoder
Jr. Member
*
Offline Offline

Activity: 50


View Profile

Ignore
April 01, 2013, 10:08:44 AM
 #1

Hi,

Summary of informations from this thread:

Reposetory with my changes for rpcminer-mod (only Cuda): https://github.com/psychocoderHPC/rpcminer-mod
Reposetory with all changes and windows support (CUDA und OpenCL) (administrated by charliemaggot): https://github.com/cdmackie/rpcminer-mod

Known CUDA Errors unter Windows:
- cudart32_50.dll or cudart32_50.dll is missing    -> install https://developer.nvidia.com/cuda-downloads to solve this problem
- bitcoinminercuda.cpp:174 crash -> this means that the kernel run longer than windows allow, to solv this add the paramter -gpugrid 256 or other number to the parameters

Original Post:

I have changed the code of rpcminer-mod https://github.com/Ang3lus/rpcminer-mod a little bit thus we get better performance on Kepler GPUs.
First, since cuda 5.0 we have a rotated function inside of the ptx (parallel asm), we must not add this by hand because the compiler find it automaticly.

Example: (((x ) >> bits) | (x << (32 - bits))) is compiled to ptx command shf.l.wrap.b32

Note: The changes are not comatible with the opencl version, I only change the cuda source. In CMAKEList.txt is hard coded that sm_35 (Kepler code) is created.

Run new code with this parameter: -gpu=0  -aggression=8 -gpugrid=2048 -gputhreads=256

To install the patch in your code goto root of the project and run $  patch -p1 < patch.txt

Now you get over 300MHash/s out of a Kepler GPU, I think that GTX Kepler GPUs are faster than K20 HPC version. It can be that you must use sm_30 for GTX Kepler GPUs.

[EDIT:] for 330MHash/s the GPU needs 138 Watt power.

psychocoder

patch.txt
Code:
diff -Naur ./cmake-rpcminer/CMakeLists.txt ../../rpcminer-cuda_svn//cmake-rpcminer/CMakeLists.txt
--- ./cmake-rpcminer/CMakeLists.txt 2013-01-28 19:27:46.000000000 +0100
+++ ../../rpcminer-cuda_svn//cmake-rpcminer/CMakeLists.txt 2013-04-01 11:22:19.000000000 +0200
@@ -32,6 +32,10 @@
 
 IF(BITCOIN_ENABLE_CUDA)
  ADD_DEFINITIONS(-D_BITCOIN_MINER_CUDA_)
+ #SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_35,code=sm_35)
+ SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -gencode=arch=compute_35,code=sm_35  -Xptxas=-v -Xopencc=-LIST:source=on)
+ #SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -gencode=arch=compute_13,code=sm_13 -Xptxas=-v -Xopencc=-LIST:source=on)
+
  CUDA_ADD_EXECUTABLE(rpcminer ${BITCOIN_RPC_MINER_SRC} ${BITCOIN_RPC_MINER_CUDA_SRC})
  # Install generated PTX CUDA module
  INSTALL(FILES "${CMAKE_CURRENT_BINARY_DIR}/${generated_file_basename}.ptx" DESTINATION "${CMAKE_CURRENT_BINARY_DIR}/rpcminer-cuda" RENAME "bitcoinminercuda.ptx")
diff -Naur ./patch.txt ../../rpcminer-cuda_svn//patch.txt
--- ./patch.txt 2013-04-01 11:30:47.000000000 +0200
+++ ../../rpcminer-cuda_svn//patch.txt 1970-01-01 01:00:00.000000000 +0100
@@ -1,14 +0,0 @@
-diff -Naur ./cmake-rpcminer/CMakeLists.txt ../../rpcminer-cuda_svn//cmake-rpcminer/CMakeLists.txt
---- ./cmake-rpcminer/CMakeLists.txt 2013-01-28 19:27:46.000000000 +0100
-+++ ../../rpcminer-cuda_svn//cmake-rpcminer/CMakeLists.txt 2013-04-01 11:22:19.000000000 +0200
-@@ -32,6 +32,10 @@
-
- IF(BITCOIN_ENABLE_CUDA)
- ADD_DEFINITIONS(-D_BITCOIN_MINER_CUDA_)
-+ #SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_35,code=sm_35)
-+ SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -gencode=arch=compute_35,code=sm_35  -Xptxas=-v -Xopencc=-LIST:source=on)
-+ #SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -gencode=arch=compute_13,code=sm_13 -Xptxas=-v -Xopencc=-LIST:source=on)
-+
- CUDA_ADD_EXECUTABLE(rpcminer ${BITCOIN_RPC_MINER_SRC} ${BITCOIN_RPC_MINER_CUDA_SRC})
- # Install generated PTX CUDA module
- INSTALL(FILES "${CMAKE_CURRENT_BINARY_DIR}/${generated_file_basename}.ptx" DESTINATION "${CMAKE_CURRENT_BINARY_DIR}/rpcminer-cuda" RENAME "bitcoinminercuda.ptx")
diff -Naur ./src/cuda/bitcoinminercuda.cpp ../../rpcminer-cuda_svn//src/cuda/bitcoinminercuda.cpp
--- ./src/cuda/bitcoinminercuda.cpp 2013-01-28 19:27:46.000000000 +0100
+++ ../../rpcminer-cuda_svn//src/cuda/bitcoinminercuda.cpp 2013-04-01 11:22:30.000000000 +0200
@@ -290,6 +290,8 @@
  {
  AllocateResources(m_numb,m_numt);
  }
+ m_out[0].m_bestnonce=0;
+ cuMemcpyHtoD(m_devout,m_out,/*m_numb*m_numt*/sizeof(cuda_out));
 
  cuMemcpyHtoD(m_devin,m_in,sizeof(cuda_in));
 
@@ -319,11 +321,11 @@
  cuFuncSetBlockShape(m_function,m_numt,1,1);
  cuLaunchGrid(m_function,m_numb,1);
 
- cuMemcpyDtoH(m_out,m_devout,m_numb*m_numt*sizeof(cuda_out));
+ cuMemcpyDtoH(m_out,m_devout,/*m_numb*m_numt*/sizeof(cuda_out));
 
  // very unlikely that we will find more than 1 hash with H=0
  // so we'll just return the first one and not even worry about G
- for(int i=0; i<m_numb*m_numt; i++)
+ for(int i=0; i<1/*m_numb*m_numt*/; i++)
  {
  if(m_out[i].m_bestnonce!=0)// && m_out[i].m_bestg<bestg)
  {
diff -Naur ./src/cuda/bitcoinminercuda.cu ../../rpcminer-cuda_svn//src/cuda/bitcoinminercuda.cu
--- ./src/cuda/bitcoinminercuda.cu 2013-01-28 19:27:46.000000000 +0100
+++ ../../rpcminer-cuda_svn//src/cuda/bitcoinminercuda.cu 2013-04-01 11:22:30.000000000 +0200
@@ -18,20 +18,31 @@
 
 #include "cudashared.h"
 
-#define byteswap(x) (((x>>24) & 0x000000ff) | ((x>>8) & 0x0000ff00) | ((x<<8) & 0x00ff0000) | ((x<<24) & 0xff000000))
-#define rotateright(x,bits) (((x & 0xffffffff) >> bits) | (x << (32 - bits)))
-#define R(x) (work[x] = (rotateright(work[x-2],17)^rotateright(work[x-2],19)^((work[x-2]&0xffffffff)>>10)) + work[x -  7] + (rotateright(work[x-15],7)^rotateright(work[x-15],18)^((work[x-15]&0xffffffff)>>3)) + work[x - 16])
+#define rotateright(x,bits) (((x ) >> bits) | (x << (32 - bits)))
+
+#define R(x) (work[x] = (rotateright(work[x-2],17)^rotateright(work[x-2],19)^((work[x-2])>>10)) + work[x -  7] + (rotateright(work[x-15],7)^rotateright(work[x-15],18)^((work[x-15])>>3)) + work[x - 16])
+
 #define sharound(a,b,c,d,e,f,g,h,x,K) {t1=h+(rotateright(e,6)^rotateright(e,11)^rotateright(e,25))+(g^(e&(f^g)))+K+x; t2=(rotateright(a,2)^rotateright(a,13)^rotateright(a,22))+((a&b)|(c&(a|b))); d+=t1; h=t1+t2;}
 
-extern "C" __global__ void cuda_process(cuda_in *in, cuda_out *out, const unsigned int loops, const unsigned int bits)
+extern "C" __global__ void cuda_process(cuda_in __restrict__ *in, cuda_out __restrict__ *out, const unsigned int loops, const unsigned int bits)
 {
+  /*variable to check if any other block has a solution*/
+  __shared__ unsigned int canExit;
+  if(threadIdx.x==0)
+ canExit=out[0].m_bestnonce;
+  __syncthreads();
+  /*exit as fast as posible if one block has finished with solution*/
+  if(canExit!=0) return;
+
 
   unsigned int work[64];
   unsigned int A,B,C,D,E,F,G,H;
  const unsigned int myid=(blockIdx.x*blockDim.x+threadIdx.x);
  const unsigned int nonce=in->m_nonce + (myid << bits);
  unsigned int t1,t2;
- unsigned int bestnonce=0;
+ //unsigned int bestnonce=0;
+
+
  //unsigned int bestg=~0;
 
  unsigned int* in_m_AH = in->m_AH;
@@ -40,46 +51,42 @@
   unsigned int in_m_nbits = in->m_nbits;
 
  // the first 3 rounds we can do outside the loop because they depend on work[0] through work[2] which won't change
- unsigned int A1,B1,C1,D1,E1,F1,G1,H1;
- A1=in_m_AH[0];
- B1=in_m_AH[1];
- C1=in_m_AH[2];
- D1=in_m_AH[3];
- E1=in_m_AH[4];
- F1=in_m_AH[5];
- G1=in_m_AH[6];
- H1=in_m_AH[7];
- sharound(A1,B1,C1,D1,E1,F1,G1,H1,in_m_merkle,0x428A2F98);
- sharound(H1,A1,B1,C1,D1,E1,F1,G1,in_m_ntime,0x71374491);
- sharound(G1,H1,A1,B1,C1,D1,E1,F1,in_m_nbits,0xB5C0FBCF);
-    
-    #pragma unroll 1
- for(unsigned int it=0; it<loops; it++)
+ /* move old A1, ... H1 to shared to solve registers
+ * can also calculated on host and give to kernel, because its se same for all threads and blocks
+ */
+ __shared__ unsigned int AH[8];
+ __shared__ unsigned int AH2[8]; //cache for second round
+ if(threadIdx.x<8)
+ {
+  AH2[threadIdx.x]=AH[threadIdx.x]=in_m_AH[threadIdx.x];
+ }
+ __syncthreads();
+ if(threadIdx.x==0)
+ {
+ sharound(AH[0],AH[1],AH[2],AH[3],AH[4],AH[5],AH[6],AH[7],in_m_merkle,0x428A2F98);
+ sharound(AH[7],AH[0],AH[1],AH[2],AH[3],AH[4],AH[5],AH[6],in_m_ntime,0x71374491);
+ sharound(AH[6],AH[7],AH[0],AH[1],AH[2],AH[3],AH[4],AH[5],in_m_nbits,0xB5C0FBCF);
+ }
+ __syncthreads();
+
+        #pragma unroll 1
+ for(unsigned int it=0; it<loops; ++it)
  {
- /*
- A=in_m_AH[0];
- B=in_m_AH[1];
- C=in_m_AH[2];
- D=in_m_AH[3];
- E=in_m_AH[4];
- F=in_m_AH[5];
- G=in_m_AH[6];
- H=in_m_AH[7];
- */
- A=A1;
- B=B1;
- C=C1;
- D=D1;
- E=E1;
- F=F1;
- G=G1;
- H=H1;
+ if(out[0].m_bestnonce!=0) return;
 
+ A=AH[0];
+ B=AH[1];
+ C=AH[2];
+ D=AH[3];
+ E=AH[4];
+ F=AH[5];
+ G=AH[6];
+ H=AH[7];
  work[0]=in_m_merkle;
  work[1]=in_m_ntime;
  work[2]=in_m_nbits;
  //work[3]=byteswap(nonce+it);
- work[3]=nonce + it;
+ work[3]=nonce +it;
  work[4]=0x80000000;
  work[5]=0x00000000;
  work[6]=0x00000000;
@@ -160,14 +167,14 @@
 
  // hash the hash now
 
- work[0]=in_m_AH[0]+A;
- work[1]=in_m_AH[1]+B;
- work[2]=in_m_AH[2]+C;
- work[3]=in_m_AH[3]+D;
- work[4]=in_m_AH[4]+E;
- work[5]=in_m_AH[5]+F;
- work[6]=in_m_AH[6]+G;
- work[7]=in_m_AH[7]+H;
+ work[0]=AH2[0]+A;
+ work[1]=AH2[1]+B;
+ work[2]=AH2[2]+C;
+ work[3]=AH2[3]+D;
+ work[4]=AH2[4]+E;
+ work[5]=AH2[5]+F;
+ work[6]=AH2[6]+G;
+ work[7]=AH2[7]+H;
  work[8]=0x80000000;
  work[9]=0x00000000;
  work[10]=0x00000000;
@@ -258,13 +265,15 @@
 
  if((H==0))// && (G<=bestg))
  {
- bestnonce=nonce+it;
+ //bestnonce=nonce+it;
+ atomicExch(&(out[0].m_bestnonce),nonce+it); /*we only need one solution*/
+
  //bestg=G;
  }
 
    }
    
-    out[myid].m_bestnonce=bestnonce;
+    //out[myid].m_bestnonce=bestnonce;
     //out[myid].m_bestg=bestg;
 
 }


donations BTC: 1BEEBBTNwZAk9bEkLU56nR5GLWNMj98K9F
1397637864
Hero Member
*
Offline Offline

Posts: 1397637864

View Profile Personal Message (Offline)

Ignore
1397637864
Reply with quote  #2

1397637864
Report to moderator
1397637864
Hero Member
*
Offline Offline

Posts: 1397637864

View Profile Personal Message (Offline)

Ignore
1397637864
Reply with quote  #2

1397637864
Report to moderator
1397637864
Hero Member
*
Offline Offline

Posts: 1397637864

View Profile Personal Message (Offline)

Ignore
1397637864
Reply with quote  #2

1397637864
Report to moderator
CoinReporting   A Portfolio Manager For All Your Digital Currencies
» Join Now For FREE «

Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1397637864
Hero Member
*
Offline Offline

Posts: 1397637864

View Profile Personal Message (Offline)

Ignore
1397637864
Reply with quote  #2

1397637864
Report to moderator
wumpus
Hero Member
*****
Offline Offline

Activity: 630

No Maps for These Territories


View Profile

Ignore
April 01, 2013, 10:33:15 AM
 #2

Nice find, so NVidia finally has a rotate instruction

Bitcoin Core developer [PGP]  Tips: 1L125pF2e7himW43Qu752ZFLtBLicxQmng Warning: For most, coin loss is a larger risk than coin theft. A disk can die any time. Regularly back up your wallet through FileBackup Wallet to an external storage or the (encrypted!) cloud. Use a separate offline wallet for storing larger amounts.
philips
Hero Member
*****
Offline Offline

Activity: 616



View Profile

Ignore
April 01, 2013, 10:58:37 AM
 #3

Watching  Shocked
niooron
Full Member
***
Offline Offline

Activity: 190


View Profile

Ignore
April 01, 2013, 11:20:28 PM
 #4

Now difficulty will explode even more.

14dxwuQwkQiLbZjJFfciZ26xSGdRU5mKEp
goxed
Hero Member
*****
Offline Offline

Activity: 854


Always try before you buy.


View Profile

Ignore
April 01, 2013, 11:22:23 PM
 #5

good news for nvidia camp

QikFury boards, from chips to mining in 2hrs. https://bitcointalk.org/index.php?topic=391689.0
Never underestimate the power of a pencil Wink https://bitcointalk.org/index.php?topic=287590.msg3099428#msg3099428  http://i.imgur.com/FY1CBQyl.jpg Your donations allows me to experiment with HW 1FrW5ka9CRVWdL9EM4Jut9eHWU1a1biyy5
coastermonger
Sr. Member
****
Offline Offline

Activity: 345


View Profile

Ignore
April 02, 2013, 12:15:12 AM
 #6

Wow, can you imagine if this were distributed a year ago? 

Now I just have to figure out if I can run this inside 50miner, or if I have to use cgminer...
relm9
Sr. Member
****
Online Online

Activity: 406



View Profile

Ignore
April 02, 2013, 12:18:30 AM
 #7

I've got a GTX Titan I could test this on - though, is there a version of this that compiles easily on Windows? The one the OP linked is Linux only.
philips
Hero Member
*****
Offline Offline

Activity: 616



View Profile

Ignore
April 02, 2013, 12:36:19 AM
 #8

Wow, can you imagine if this were distributed a year ago? 

Now I just have to figure out if I can run this inside 50miner, or if I have to use cgminer...

Maybe is not too late for Nvidia cards though...there is also this guy:
https://bitcointalk.org/index.php?topic=160057.0
psychocoder
Jr. Member
*
Offline Offline

Activity: 50


View Profile

Ignore
April 02, 2013, 05:41:03 AM
 #9

@relm9: No I have winows version, I only programm linux. I exit my bitcoin winter sleep to performe the NVIDIA GPU bitcoin mining process. I have now windows PC with K20 or Titan and therefore I can't test this with a winows miner.

@philips: Thanks for the last link, I look in if I can get some more performance.


donations BTC: 1BEEBBTNwZAk9bEkLU56nR5GLWNMj98K9F
ssateneth
Hero Member
*****
Offline Offline

Activity: 1022



View Profile

Ignore
April 02, 2013, 10:51:00 AM
 #10

Please compile and build for windows. No clue what to do with source. I can only click stuff.

If I helped you, please consider donating some BTC my way! 1FVLZTwSiAsf9z9dLZcfEuBY59HQprFJwQ
I am a long time trusted user: Bitcointalk forum trust ratings, Bitcoin-OTC Ratings, eBay Feedback, and Localbitcoins public profile.
wndrbr3d
Sr. Member
****
Offline Offline

Activity: 295


View Profile

Ignore
April 02, 2013, 12:24:14 PM
 #11

Well I feel vindicated  Grin

https://bitcointalk.org/index.php?topic=142270.0
psychocoder
Jr. Member
*
Offline Offline

Activity: 50


View Profile

Ignore
April 02, 2013, 02:22:44 PM
 #12

Not all performance came from the shift function. Most performance came from reducing registers per thread.

Before I start one threads needed 114 32Bit register (134 MHash/s)
After change the code thus we use shift operation we needed 95 32Bit register (~200MHash/s)
And after add shared memory we only need 46 registers. That means we can run 5 Block with 256 threads per streaming multiprocessor and we get 330 MHash/s.

At the moment I work on a version with over 400 MHash/s but I have some problems that the mining pool not count all my solutions.

If I have time I look if I can create a windows version.
 

donations BTC: 1BEEBBTNwZAk9bEkLU56nR5GLWNMj98K9F
Marrs
Member
**
Offline Offline

Activity: 112


View Profile

Ignore
April 02, 2013, 03:38:03 PM
 #13

Not all performance came from the shift function. Most performance came from reducing registers per thread.

Based on that, do you believe your changes should provide benefit to all nVidia hardware, not just Kepler-based boards?
crazyates
Hero Member
*****
Offline Offline

Activity: 700



View Profile

Ignore
April 02, 2013, 04:28:53 PM
 #14

Soo... still about the same as a $75 5830. GJ!

Tips? 1crazy8pMqgwJ7tX7ZPZmyPwFbc6xZKM9
PowerColor R9 290s for sale! | Previous Trade History
mitty
Sr. Member
****
Offline Offline

Activity: 361



View Profile

Ignore
April 02, 2013, 04:36:00 PM
 #15

Soo... still about the same as a $75 5830. GJ!
True, but this is pretty big for people who already own Nvidia cards and can now mine more efficiently with them.
wndrbr3d
Sr. Member
****
Offline Offline

Activity: 295


View Profile

Ignore
April 02, 2013, 06:22:14 PM
 #16

Based on that, do you believe your changes should provide benefit to all nVidia hardware, not just Kepler-based boards?

The shift functions are only available on Kepler based GPU's, -but- the other optimizations he has worked in there could give non-Kepler based cards the ~200Mhash performance.
kpriess
Sr. Member
****
Offline Offline

Activity: 410


Der will zur macht..


View Profile

Ignore
April 02, 2013, 06:59:17 PM
 #17


I would appreciate if you could compile the Windows' binaries..
Most nvidia users are on Windows due to well, directx..
Evan
Sr. Member
****
Offline Offline

Activity: 294



View Profile

Ignore
April 02, 2013, 11:58:03 PM
 #18

I will give  .25BTC for a Windows version to be created and maintained.... I have 37 nvidia cards that are BEGGING TO BE WORKED

I am poor, but i do work for Coin Smiley
1PtHcavXoakgNkQfEQdvnvEksEY2NvwaLM
Wolf0
Sr. Member
****
Offline Offline

Activity: 448


View Profile

Ignore
April 03, 2013, 03:17:39 AM
 #19

I will give  .25BTC for a Windows version to be created and maintained.... I have 37 nvidia cards that are BEGGING TO BE WORKED

I'll compile it for Windows, if it compiles for Windows.

EDIT: Ugh, cmake. I have no idea how to cross-compile this.

BTC.sx - For an easy way to trade Bitcoin, using Bitcoin.
Buy cheap GPUs with GPU Coin
Win a video card! - Drawing/Contest
Jay_Pal
Hero Member
*****
Offline Offline

Activity: 818



View Profile

Ignore
April 03, 2013, 08:28:13 AM
 #20

Based on that, do you believe your changes should provide benefit to all nVidia hardware, not just Kepler-based boards?

The shift functions are only available on Kepler based GPU's, -but- the other optimizations he has worked in there could give non-Kepler based cards the ~200Mhash performance.

I'd love to test this on my rig but I don't know if it would be ever possible since it has an onboard GeForce 8200 that used to give ~20Mh/s, but never been able to use or even see it after plugging an ATI Radeon 5550.

Would it be any way I could "see" this device listed on lspci and have it working to mine with this patch?

  BITMIXER.IO   High Volume Bitcoin MIXER  
Don't Panic... Earn BitCoins doing NOTHING ## Up to 0.225BTC - 1G8zjUzeZBfJpeCbz1MLTc6zQHbLm78vKc
Pages: [1] 2 3 4 5 6 7 8 9 10 11  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!