Bitcoin Forum
September 20, 2024, 04:39:21 AM *
News: Latest Bitcoin Core release: 27.1 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 [3] 4 »  All
  Print  
Author Topic: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10  (Read 24743 times)
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
August 18, 2010, 11:02:25 PM
 #41

So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?


And i5, at least on my macbookpro
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
August 18, 2010, 11:14:26 PM
Last edit: August 22, 2010, 03:00:00 AM by Ground Loop
 #42

Any non-Mac i5 love?
Windows i5 64-bit got slower here.
[correction -- not true.  Windows doesn't have -4way, and the Linux machines are Xeons.]

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
vess
Full Member
***
Offline Offline

Activity: 141
Merit: 100



View Profile WWW
August 19, 2010, 06:41:27 AM
 #43

My Core i5 laptop (Ubuntu) doubled in speed. Actually, it didn't double in speed. It stayed the same speed, but only uses half the CPU now. I can't get it to go back to full CPU usage. That said, my laptop is a lot cooler when generating blocks now. I'll post back if I see it successfully go up to 100% usage.

I'm the CEO of CoinLab (www.coinlab.com) and the Executive Director of the Bitcoin Foundation, I will identify if I'm speaking for myself or one of the organizations when I post from this account.
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7065


View Profile
August 19, 2010, 07:07:43 PM
 #44

Any non-Mac i5 love?
Windows i5 64-bit got slower here.
That's the first I've heard anyone say i5 was slower.  Everyone else has said 4way was faster on i5.  Moreso with hyperthreading enabled.

And i5, at least on my macbookpro
Good, so I take it that's a confirmation that it's working on Mac as well?

Laszlo told me he did compile in the -4way stuff on Mac, so the -4way switch is also available to try on Mac.  I don't think makefile.osx on SVN has it yet, just the built version.
nelisky
Legendary
*
Offline Offline

Activity: 1540
Merit: 1002


View Profile
August 19, 2010, 08:28:19 PM
 #45

And i5, at least on my macbookpro
Good, so I take it that's a confirmation that it's working on Mac as well?

Laszlo told me he did compile in the -4way stuff on Mac, so the -4way switch is also available to try on Mac.  I don't think makefile.osx on SVN has it yet, just the built version.

Yep, it's working all right. The number I had posted were from an old svn revision patched with tcatm's changes, but today I compiled trunk and while I had to once again tweak the makefile, after I did it works great with the numbers matching what I experienced before.

Changes I did for my system are below, and while some are cosmetic, like removing wx-config from making bitcoind, just to avoid the warnings if you don't have it installed, others are system specific, like the DEPS dir, and the fact I don't have 32bit libs which makes the link step fail if -arch i386 is there.
The bsddb changes are, I believe, a typo. Includes and Libs point to db46, but then the object list for the linker states db48. Anyway, here's the diff for what got me going:

Code:
Index: makefile.osx
===================================================================
--- makefile.osx (revision 139)
+++ makefile.osx (working copy)
@@ -6,29 +6,29 @@
 # Laszlo Hanyecz (solar@heliacal.net)
 
 CXX=llvm-g++
-DEPSDIR=/Users/macosuser/bitcoin/deps
+DEPSDIR=/opt/local
 
 INCLUDEPATHS= \
- -I"$(DEPSDIR)/include"
+ -I"$(DEPSDIR)/include"  -I"$(DEPSDIR)/include/db46"
 
 LIBPATHS= \
- -L"$(DEPSDIR)/lib"
+ -L"$(DEPSDIR)/lib"  -L"$(DEPSDIR)/lib/db46"
 
-WXLIBS=$(shell $(DEPSDIR)/bin/wx-config --libs --static)
+WXLIBS=
 
 LIBS= -dead_strip \
- $(DEPSDIR)/lib/libdb_cxx-4.8.a \
- $(DEPSDIR)/lib/libboost_system.a \
- $(DEPSDIR)/lib/libboost_filesystem.a \
- $(DEPSDIR)/lib/libboost_program_options.a \
- $(DEPSDIR)/lib/libboost_thread.a \
+ $(DEPSDIR)/lib/db46/libdb_cxx-4.6.a \
+ $(DEPSDIR)/lib/libboost_system-mt.a \
+ $(DEPSDIR)/lib/libboost_filesystem-mt.a \
+ $(DEPSDIR)/lib/libboost_program_options-mt.a \
+ $(DEPSDIR)/lib/libboost_thread-mt.a \
  $(DEPSDIR)/lib/libcrypto.a
 
-DEFS=$(shell $(DEPSDIR)/bin/wx-config --cxxflags) -D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0
+DEFS=-D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0 -DFOURWAYSSE2
 
 DEBUGFLAGS=-g -DwxDEBUG_LEVEL=0
 # ppc doesn't work because we don't support big-endian
-CFLAGS=-mmacosx-version-min=10.5 -arch i386 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS)
+CFLAGS=-mmacosx-version-min=10.5 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS)
 HEADERS=headers.h strlcpy.h serialize.h uint256.h util.h key.h bignum.h base58.h \
     script.h db.h net.h irc.h main.h rpc.h uibase.h ui.h noui.h init.h
 
@@ -42,6 +42,7 @@
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
+    obj/sha256.o \
     cryptopp/obj/cpu.o
 
 
@@ -55,7 +56,7 @@
  $(CXX) -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_ASM -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
- $(CXX) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
+ $(CXX) $(shell $(DEPSDIR)/bin/wx-config --cxxflags) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(shell $(DEPSDIR)/bin/wx-config --libs --static) $(LIBS)
 
 
 obj/nogui/%.o: %.cpp $(HEADERS)
ArtForz
Sr. Member
****
Offline Offline

Activity: 406
Merit: 257


View Profile
August 21, 2010, 04:56:31 PM
 #46

The difference between new and older CPUs is pretty easy to explain.
Older microarchitectures have 64-bit mmx/sse execution units and split 128bit sse ops into 2 64bit microops.
Newer archs have 128bit sse units.
  • AMD K8: 2 64bit units
  • intel Core/Core2: 3 64bit units
  • AMD K10: 2 128bit units
  • intel nehalem: 3 128bit units
K10 = Opterons with 4 or more cores, Phenom, Phenom II, Athlon II
nehalem = xeon 34xx/35xx/36xx/55xx/56xx/65xx/75xx, i3/i5/i7

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7065


View Profile
August 22, 2010, 11:21:50 PM
 #47

Thanks for clearing that up.  I read the link someone posted about AMD making that change around 2007, but I didn't know what the story was for Intel.

There's no hope for Core/Core2 then.  They only have half the SSE2 hardware.

Strange that Intel has 3 128bit units, but AMD with 2 128bit units is the faster one.
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
August 23, 2010, 05:45:17 AM
 #48

Intel Atom 230 @ 1.60GHz.  Linux 32-bit.
(Acer Aspire Revo)

Stock: 438 khash/sec (1 proc gives 354)
4way:  254 khash/sec

So you can take this one off the powerhouse list. Smiley
 Grin

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
sgtstein
Member
**
Offline Offline

Activity: 61
Merit: 10


View Profile
August 24, 2010, 05:31:08 PM
 #49

Anybody catch the new AMD Bulldozer press release? If I understand correctly, it should be capable of processing 8 64bit hashes, per core, at the same time. Would be quite a speed boost using this same code design.

Slashdot has the article.
PC Perspective has the details.

Was also covered by AnandTech back in November, 2009.
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7065


View Profile
August 24, 2010, 10:43:56 PM
 #50

  • AMD K10: 2 128bit units
  • intel nehalem: 3 128bit units
This probably explains why hyperthreading increases performance with -4way.  If three SSE2 units is excessive, then hyperthreading would help keep them all busy.
tcatm
Sr. Member
****
qt
Offline Offline

Activity: 337
Merit: 265


View Profile
August 28, 2010, 12:27:08 AM
 #51

I just reviewed the sourcecode as I had a few ideas to optimize it further and I noticed that 4way is partly broken:

from main.cpp:
Code:
                for (int j = 0; j < NPAR; j++)
                {   
                    if (thash[7][j] == 0)
                    {                       
                        for (int i = 0; i < sizeof(hash)/4; i++)
                          ((unsigned int*)&hash)[i] = thash[i][j];
                        pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
                    }   
                }   

The code will only process one hash (the last with thash[7] == 0) out of 32 hashes even when there is more than one hash that might be a correct one.

Somethine like this should fix it but it won't be safe at higher difficulties. Also, I'm not sure whether the byte order should be reversed or not. Could someone review this?
Code:
                unsigned int min_hash = ~1;
       for (int j = 0; j < NPAR; j++)
                {   
                    if (thash[7][j] == 0)
                    {   
                        if(thash[6][j] < min_hash) {
                          min_hash = thash[6][j];
                          for (int i = 0; i < sizeof(hash)/4; i++)
                            ((unsigned int*)&hash)[i] = thash[i][j];
                          pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
                        }   
                    }   
                } 
satoshi (OP)
Founder
Sr. Member
*
qt
Offline Offline

Activity: 364
Merit: 7065


View Profile
August 28, 2010, 02:27:15 PM
 #52

The simplification is intentional.  There will only be more than one thash[7]=0 in one out of 134,217,728 cases.  It only makes it 0.0000007% slower.
Gespenster
Newbie
*
Offline Offline

Activity: 15
Merit: 0


View Profile
August 29, 2010, 11:11:32 AM
 #53

@sgtstein: Intel's Sandy Bridge (to be released Q4 2010) will also support AVX 256-bit SIMD registers. That means 8 simultaneous hash calculations/thread would be possible, in principle.

Does anybody has any reports on 4-way SSE2 on the Pentium D (Presler)? What kind of performance can I expect? I have an old Pentium D+mobo laying around and I would fire it up as a mining server if performance would be ok. Probably won't be the most efficient khash/Watt though.
lfm
Full Member
***
Offline Offline

Activity: 196
Merit: 104



View Profile
August 29, 2010, 04:55:41 PM
 #54

Does anybody has any reports on 4-way SSE2 on the Pentium D (Presler)? What kind of performance can I expect? I have an old Pentium D+mobo laying around and I would fire it up as a mining server if performance would be ok. Probably won't be the most efficient khash/Watt though.

My Pentium-D died but it was generally just two P4s in one package and probably will do bitcoin like that. Yes it was terribly power hungry. The 4way code doesn't do very well on P4s in general. I only get about 900 khash/s on a 3.4ghz P4 without -4way. With -4way its in the 600s.

Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
August 31, 2010, 12:48:33 AM
 #55

Seriously?  Got free electricity?

At Difficulty 623, I've shut down anything under 3000 khash/sec.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
lfm
Full Member
***
Offline Offline

Activity: 196
Merit: 104



View Profile
September 01, 2010, 05:23:26 PM
 #56

Seriously?  Got free electricity?

At Difficulty 623, I've shut down anything under 3000 khash/sec.


I don't have free electricity but I am running a number of electric heaters that look like computers.

The bits produced are a by-product.
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
September 16, 2010, 07:08:19 PM
 #57

4way pays off on one of the HP blade machines..
It's a 12-core Intel(R) Xeon(R) CPU X5650  @ 2.67GHz

Running 24 threads with 4way, I get 22,569 khash/sec.

Yow.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
LZ
Legendary
*
Offline Offline

Activity: 1722
Merit: 1072


P2P Cryptocurrency


View Profile
September 16, 2010, 08:20:19 PM
 #58

Cool! This means that the processor in practice can catch up with the video card!

My OpenPGP fingerprint: 5099EB8C0F2E68C63B4ECBB9A9D0993E04143362
Ground Loop
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
September 16, 2010, 09:24:30 PM
 #59

It's still not cost effective.
These are HP BL460c blades.. around $6k each.  That buys a lot of fresh CUDA!

It's a fun way to do "burn in", but not a smart use of resources.
12 hyperthreading Xeon cores, though..  each.

22,500 khash/sec with -4way, and only 13,400 without, so yeah, it's not subtle.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq
BeeCee1
Member
**
Offline Offline

Activity: 115
Merit: 10


View Profile
January 21, 2011, 02:11:10 AM
 #60

I made a couple of changes to the sse2 source that sped it up about 5 or 6%:

I changed:
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
to
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(x0, x1),_mm_add_epi32( x2,x3))

It is just re-ordering the adds.  There is a data dependency, each one depends on the result of the one before, the way I reordered it two of the adds are independent.  This function is called a lot of times so that little change can add up. (On an older machine it made no difference so YMMV)

A portion of the nonce calculation is repeated over and over, even though the result is the same.  I moved
nonce = _mm_set1_epi32(In[3]);
nonce = _mm_add_epi32(nonce, offset);
out of the "for(k = 0; k<NPAR; k+=4) {"

Here's a diff
153c153
>     __m128i nonce,preNonce;
---
<     __m128i nonce;
157d156
>     preNonce = _mm_add_epi32(_mm_set1_epi32(In[3]),offset);
179,182c178,180
>         //nonce = _mm_set1_epi32(In[3]);
>         //nonce = _mm_add_epi32(nonce, offset);
>         //nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));
>         nonce = _mm_add_epi32(preNonce,_mm_set1_epi32(k));
---
<         nonce = _mm_set1_epi32(In[3]);
<         nonce = _mm_add_epi32(nonce, offset);
<         nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));


I have been running this for a couple of days on the mining pool and have generated shares.
Pages: « 1 2 [3] 4 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!