Bitcoin Forum
April 20, 2014, 08:15:03 AM *
News: Due to the OpenSSL heartbleed bug, changing your forum password is recommended.
 
   Home   Help Search Donate Login Register  
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 [35] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
  Print  
Author Topic: Ufasoft Miner - Windows/Linux, x86/x64, SSE2/OpenCL, Open Source  (Read 441265 times)
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 364



View Profile WWW

Ignore
February 01, 2012, 07:05:39 AM
 #681

Actually I have a complaint. Your miner will work for about 20 minutes correctly, after which every share is rejected. I am using slush's pool which is accepting most of my GPU work, but your miner accepts no more than 3 shares before every share is rejected.
You must be using the 64-bit version then.  It seems to have a problem with the block updates.  Once the block changes, it's not keeping up.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
1397981703
Hero Member
*
Offline Offline

Posts: 1397981703

View Profile Personal Message (Offline)

Ignore
1397981703
Reply with quote  #2

1397981703
Report to moderator
"You Asked For Change, We Gave You Coins" -- casascius
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1397981703
Hero Member
*
Offline Offline

Posts: 1397981703

View Profile Personal Message (Offline)

Ignore
1397981703
Reply with quote  #2

1397981703
Report to moderator
1397981703
Hero Member
*
Offline Offline

Posts: 1397981703

View Profile Personal Message (Offline)

Ignore
1397981703
Reply with quote  #2

1397981703
Report to moderator
1397981703
Hero Member
*
Offline Offline

Posts: 1397981703

View Profile Personal Message (Offline)

Ignore
1397981703
Reply with quote  #2

1397981703
Report to moderator
djinfected
Newbie
*
Offline Offline

Activity: 24


View Profile

Ignore
February 01, 2012, 11:49:48 AM
 #682

Actually I have a complaint. Your miner will work for about 20 minutes correctly, after which every share is rejected. I am using slush's pool which is accepting most of my GPU work, but your miner accepts no more than 3 shares before every share is rejected.
You must be using the 64-bit version then.  It seems to have a problem with the block updates.  Once the block changes, it's not keeping up.
Oh, yeah I am. I'll try the 32-bit version for a while. Thanks.
ufasoft
Sr. Member
****
Offline Offline

Activity: 380


View Profile WWW

Ignore
February 04, 2012, 10:49:38 AM
 #683

You must be using the 64-bit version then.  It seems to have a problem with the block updates.  Once the block changes, it's not keeping up.

Oh, yeah I am. I'll try the 32-bit version for a while. Thanks.

This x64 CPU-mining bug fixed in 0.27 version

Bitcoin donations: 16kfodhAckE8FZQpNcDwzG3tDGxypGTdwm
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 364



View Profile WWW

Ignore
February 04, 2012, 01:17:25 PM
 #684

Well, it seems to work.  However, the 64-bit version appears slower than the 32-bit on Windows 7.
Also, I was going through some of the source code for the assembly--you know, if you added some if statements based upon processor capabilities, you could add quite a few optimizations.  For one, streaming moves by using movntdqa to avoid some of the lower caches (if available), using YMM registers which are just combining two XMM registers making the code compatible (if available), use of AVX extensions to combine a few functions (if available), etc.  Heck, anymore, CPUs are capable of 256-bit computing.  So you can effectively double the output.
Granted, this is based off of the linux source code.  But yeah, standard SSE2 is compatible, but a little spice could be nice.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
ufasoft
Sr. Member
****
Offline Offline

Activity: 380


View Profile WWW

Ignore
February 04, 2012, 01:34:18 PM
 #685

upon processor capabilities, you could add quite a few optimizations.  For one, streaming moves by using movntdqa to avoid some of the lower caches (if available)
Without caching it will be slower obviously.

, using YMM registers which are just combining two XMM registers making the code compatible (if available), use of AVX extensions to combine a few functions (if available), etc.  Heck, anymore, CPUs are capable of 256-bit computing. 
AVX don't support Integer operations, it has Float-point only ALU.

Bitcoin donations: 16kfodhAckE8FZQpNcDwzG3tDGxypGTdwm
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 364



View Profile WWW

Ignore
February 04, 2012, 02:26:44 PM
 #686

upon processor capabilities, you could add quite a few optimizations.  For one, streaming moves by using movntdqa to avoid some of the lower caches (if available)
Without caching it will be slower obviously.

, using YMM registers which are just combining two XMM registers making the code compatible (if available), use of AVX extensions to combine a few functions (if available), etc.  Heck, anymore, CPUs are capable of 256-bit computing. 
AVX don't support Integer operations, it has Float-point only ALU.
The MOVNTDQA does cache, but it caches directly to L1 for immediate use while skipping over the other caches.  And AVX does support integers; it just allows them to be vectorized into 256-bits so you can perform the same calculation on 4 integers at once.  http://software.intel.com/en-us/articles/intel-avx-new-frontiers-in-performance-improvements-and-energy-efficiency/
It does, however, say that float points will benefit the most, but the code can be vectorized for AVX capable processors.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
ufasoft
Sr. Member
****
Offline Offline

Activity: 380


View Profile WWW

Ignore
February 04, 2012, 02:54:51 PM
 #687

The MOVNTDQA does cache, but it caches directly to L1 for immediate use while skipping over the other caches.  And
I dont see any performance improvement here. In both cases I have L1-cached data.

But you can try, just patch the .ASM file and build it in Linux.

AVX does support integers; it just allows them to be vectorized into 256-bits so you can perform the same calculation
This doc says:
Extensibility: Intel AVX has powerful built-in extensibility options for the future without resorting to code growth:
OS context management rework only needs to be done once.
Future Vector Integer support to 256 and 512 bits


Somewhen in the future AVX will support vector Integers.

Bitcoin donations: 16kfodhAckE8FZQpNcDwzG3tDGxypGTdwm
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 364



View Profile WWW

Ignore
February 04, 2012, 04:47:06 PM
 #688

Okay, let me try one more time.  You can transfer data to the XMM registers for now.  Okay, I got ahead of myself with AVX2.  But there are integer related commands that will speed up the code like VPADDD which will take two registers and add them to the third.  But yeah, AVX2 mainly just allows the extensions to the 256-bit registers instead of just the 128-bit.  But AVX does speed up the process for integers too.  Check out the reference card and search for the word integers.  There's also vectorized shuffling, etc.  So it's not just for floating point data, but it is mainly geared toward it.  But yeah, it's at 128-bit level only as of right now.  I can't wait for the AVX2 instruction set to come out.
One instruction in particular that I've already seen will help is VPAND.  And VPADDD will probably come in handy too.
Here's the data on the streaming loads via MOVNTDQA.  http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/
But yeah, there's vectorized versions of almost all commands up to 128-bit now.  And they're not just float-point.  Unfortunately, they're probably easier to implement through C than asm.  But, again, it would have to be something determined for use based on the processor; probably by intrinsics not available through asm.  I'm just saying, as long as everything's aligned, these vectorizations can probably remove a few extra commands as long as the processors can handle them.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
ufasoft
Sr. Member
****
Offline Offline

Activity: 380


View Profile WWW

Ignore
February 04, 2012, 04:59:38 PM
 #689


I agree that MOVNTDQA useful for streaming. But SHA256 calculation is not stream processing. It need not to save anything to RAM.

But AVX does speed up the process for integers too.  Check out the reference card and search for the word integers.  There's also vectorized shuffling, etc.  So it's not just for floating point data, but it is mainly geared toward it.  But yeah, it's at 128-bit level only as of right now.  I can't wait for the AVX2 instruction set to come out.

SHA256 implementations requires integer ADD, Shift, XOR, AND.
Only XOR and AND can be done in AVX.
For other instructions it is necessary to:
1. shuffle high half of YMM to low half,
2. run Integer instruction
3. shuffle the halfs back.

Bitcoin donations: 16kfodhAckE8FZQpNcDwzG3tDGxypGTdwm
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 364



View Profile WWW

Ignore
February 04, 2012, 05:47:37 PM
 #690


I agree that MOVNTDQA useful for streaming. But SHA256 calculation is not stream processing. It need not to save anything to RAM.

But AVX does speed up the process for integers too.  Check out the reference card and search for the word integers.  There's also vectorized shuffling, etc.  So it's not just for floating point data, but it is mainly geared toward it.  But yeah, it's at 128-byte level only as of right now.  I can't wait for the AVX2 instruction set to come out.

SHA256 implementations requires integer ADD, Shift, XOR, AND.
Only XOR and AND can be done in AVX.
For other instructions it is necessary to:
1. shuffle high half of YMM to low half,
2. run Integer instruction
3. shuffle the halfs back.

That's only if you use 256-byte which should be avoided until AVX2.  But you're only using XMM, not YMM so none of that comes into play.  And sorry, I'm about to hit the wall.  I've been up all night.  Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet.  It's coming.  However, you can use the lower half of the YMM by specifying the XMM equivalent.  The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX.  So, when AVX2 comes around, it'll be much simpler to modify the code to use the YMM registers to achieve the same things in 256-byte.
As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer.  It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.
The CPU cache is considered memory.  But yeah, I try not to memorize every detail of everything because I get confused easily.  But the idea is that it combines smaller writes into one large one by streaming them without contaminating the cache.
Here's an example taken from the Intel explanation and modified for better explanation as to how it pertains to the code.


Code:
; This load retrieves a full cache line that is stored in a temporary streaming load
; buffer
; eax is a pointer to the system allocated memory of type USWC
              MOVNTDQA xmm0, eax+0
; Subsequent 16-byte loads from the same cache line are supplied from the streaming
; load buffer and occur much faster (as the read is converted from a 16-byte to 64-byte)
              MOVNTDQA xmm1, eax+16
              MOVNTDQA xmm2, eax+32
              MOVNTDQA xmm3, eax+48

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
ufasoft
Sr. Member
****
Offline Offline

Activity: 380


View Profile WWW

Ignore
February 04, 2012, 06:11:02 PM
 #691

nly if you use 256-byte which should be avoided until AVX2.  But you're only using XMM, not YMM so none of that comes into play.  And sorry, I'm about to hit the wall.  I've been up all night.  Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet.  It's coming.  However, you can use the lower half of the YMM by specifying the XMM equivalent.  The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX.  So,
Hm, the Miner uses XMM registers (SSE2) in current implementation already.

As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer.  It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.
It is not applicable for SHA256 calculating.


Bitcoin donations: 16kfodhAckE8FZQpNcDwzG3tDGxypGTdwm
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 364



View Profile WWW

Ignore
February 04, 2012, 06:14:02 PM
 #692

LAB_NEXT_NONCE:
   mov      zsi, init
   
   mov      zbx, w
   mov      zax, pnonce              <--
   mov      eax, [zax]                <--
   mov      [zbx+3*4], eax

   mov      zcx, 64
   mov      zax, 18             <--
   mov      zdi, 3



I'm seeing a few hiccups here.  And, just a make sure, is this code going (from, to) or (to, from)?  Different compilers like it different ways.  If it's (to, from), then why not just:

LAB_NEXT_NONCE:
   mov      zsi, init
   
   mov      zbx, w
   mov      eax, pnonce
   mov      [zbx+3*4], eax

   mov      zcx, 64
   mov      zax, 18
   mov      zdi, 3

I'm finding a few of these in the 64-byte write range as well.  Granted, I didn't take into account how long it takes to complete an operation, but that shouldn't be an issue here.  But yeah, Intel like their sets of 4 instructions sometimes.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 364



View Profile WWW

Ignore
February 04, 2012, 06:20:55 PM
 #693

nly if you use 256-byte which should be avoided until AVX2.  But you're only using XMM, not YMM so none of that comes into play.  And sorry, I'm about to hit the wall.  I've been up all night.  Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet.  It's coming.  However, you can use the lower half of the YMM by specifying the XMM equivalent.  The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX.  So,
Hm, the Miner uses XMM registers (SSE2) in current implementation already.

As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer.  It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.
It is not applicable for SHA256 calculating.


So, you're telling me that we're not moving 16-byte register data into 64-byte xmm registers?  Because I see a few places here where it is.  Like here:


Code:
ELSE
movdqa xmm3, [zsp+5*16]
movdqa xmm4, [zsp+6*16]
movdqa xmm5, [zsp+7*16]

paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]

movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5
Changed to:

Code:
ELSE
movntdqa xmm3, [zsp+5*16]  ;Fetches 64-bytes of zsp and puts into buffer.  Reads an extra 16-bytes, but is transferring at 7.5x.
movntdqa xmm4, [zsp+6*16]  ;zsp already buffered so doesn't need to be read again.
movntdqa xmm5, [zsp+7*16]  ;zsp also buffered so write speed is also increased.

paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]

movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5

Note that these changes only work for SSE4 so an if-else statement is required.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
ufasoft
Sr. Member
****
Offline Offline

Activity: 380


View Profile WWW

Ignore
February 04, 2012, 07:26:48 PM
 #694

So, you're telling me that we're not moving 16-byte register data into 64-byte xmm registers?  Because I see a few places here where it is.  Like here:

Code:
ELSE
movdqa xmm3, [zsp+5*16]
movdqa xmm4, [zsp+6*16]
movdqa xmm5, [zsp+7*16]

paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]

movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5

These data are in L1-cache already, so no difference which MOV-instruction loads them.

Anyway, if you are practical programmer, you can do profiling of patched miner. Anything that not tested are just  speculations.

Bitcoin donations: 16kfodhAckE8FZQpNcDwzG3tDGxypGTdwm
greatwolf
Full Member
***
Offline Offline

Activity: 174


View Profile

Ignore
February 05, 2012, 07:19:46 AM
 #695

Is there a way to integrate the opencl kernel used by phoenix miner into ufasoft-miner? The current kernel used in ufasoft-miner seems to create a lot of gpu display lag even when the -T temperature setting is low. In Phoenix miner there's an AGGRESSION setting that can be adjusted to reduce display lag but it doesn't seem like ufasoft-miner has an equivalent option.
ufasoft
Sr. Member
****
Offline Offline

Activity: 380


View Profile WWW

Ignore
February 05, 2012, 08:14:13 AM
 #696

Is there a way to integrate the opencl kernel used by phoenix miner into ufasoft-miner? The current kernel used in ufasoft-miner seems to create a lot of gpu display lag even when the -T temperature setting is low. In Phoenix miner there's an AGGRESSION setting that can be adjusted to reduce display lag but it doesn't seem like ufasoft-miner has an equivalent option.
To implement AGGRESSION we should change C++ wrapping code. OpenCL kernels (file phatk.cl), are very similar on most miners.

Bitcoin donations: 16kfodhAckE8FZQpNcDwzG3tDGxypGTdwm
kentrolla
Hero Member
*****
Offline Offline

Activity: 504


View Profile

Ignore
February 07, 2012, 02:03:18 AM
 #697

yea, i love this miner. the only problem i have is that i can't mine with this miner while using my computer. It just gets way too laggy. i wish there was an aggression or intensity setting or something like that.

greatwolf
Full Member
***
Offline Offline

Activity: 174


View Profile

Ignore
February 13, 2012, 05:42:21 AM
 #698

yea, i love this miner. the only problem i have is that i can't mine with this miner while using my computer. It just gets way too laggy. i wish there was an aggression or intensity setting or something like that.

I completely echo kentrolla's sentiment about ufasoft-miner. What I really like about it is that it combines CPU and GPU mining together and I get a much higher hash rate on CPU mining with this than with other CPU miners.

To implement AGGRESSION we should change C++ wrapping code. OpenCL kernels (file phatk.cl), are very similar on most miners.

Are there any plans to implement AGGRESSION for say the next version of ufasoft-miner(v0.28+ maybe)? I'm also wondering if the latest repository source tree for ufasoft-miner can be found anywhere? (like using git or mercurial etc.). The main page only has the source for 0.25.

Thanks
d3m0n1q_733rz
Sr. Member
****
Offline Offline

Activity: 364



View Profile WWW

Ignore
February 13, 2012, 10:05:49 AM
 #699

yea, i love this miner. the only problem i have is that i can't mine with this miner while using my computer. It just gets way too laggy. i wish there was an aggression or intensity setting or something like that.

I completely echo kentrolla's sentiment about ufasoft-miner. What I really like about it is that it combines CPU and GPU mining together and I get a much higher hash rate on CPU mining with this than with other CPU miners.

To implement AGGRESSION we should change C++ wrapping code. OpenCL kernels (file phatk.cl), are very similar on most miners.

Are there any plans to implement AGGRESSION for say the next version of ufasoft-miner(v0.28+ maybe)? I'm also wondering if the latest repository source tree for ufasoft-miner can be found anywhere? (like using git or mercurial etc.). The main page only has the source for 0.25.

Thanks
You know, if you want to change the aggression so badly, change the nice level in Linux or the priority in Windows.  It's basically the same thing.

Funroll_Loops, the theoretically quicker breakfast cereal!
Check out http://www.facebook.com/JupiterICT for all of your computing needs.  If you need it, we can get it.  We have solutions for your computing conundrums.  BTC accepted!  12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq
K1773R
Hero Member
*****
Offline Offline

Activity: 756


/dev/null


View Profile

Ignore
February 13, 2012, 03:24:30 PM
 #700

yea, i love this miner. the only problem i have is that i can't mine with this miner while using my computer. It just gets way too laggy. i wish there was an aggression or intensity setting or something like that.

I completely echo kentrolla's sentiment about ufasoft-miner. What I really like about it is that it combines CPU and GPU mining together and I get a much higher hash rate on CPU mining with this than with other CPU miners.

To implement AGGRESSION we should change C++ wrapping code. OpenCL kernels (file phatk.cl), are very similar on most miners.

Are there any plans to implement AGGRESSION for say the next version of ufasoft-miner(v0.28+ maybe)? I'm also wondering if the latest repository source tree for ufasoft-miner can be found anywhere? (like using git or mercurial etc.). The main page only has the source for 0.25.

Thanks
You know, if you want to change the aggression so badly, change the nice level in Linux or the priority in Windows.  It's basically the same thing.
nice level dosnt affect GPU rendering/others at all...

[GPG Public Key]  [Devcoin Builds]  [Multichain Blockexplorer]  [Multichain Blockexplorer - PoS Coins]  [Ufasoft Miner Linux Builds]
BTC/DVC/TRC/FRC: 1K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM AK1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: NK1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 BQC: bK1773R1APJz4yTgRkmdKQhjhiMyQpJgfN
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 [35] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!