tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

nelisky

Legendary

Offline

Activity: 1540
Merit: 1005

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 18, 2010, 11:02:25 PM

#41

Quote from: Ground Loop on August 18, 2010, 11:00:08 PM

So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?

And i5, at least on my macbookpro

Ground Loop

Member

Offline

Activity: 111
Merit: 10

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 18, 2010, 11:14:26 PM
Last edit: August 22, 2010, 03:00:00 AM by Ground Loop

#42

Any non-Mac i5 love?
Windows i5 64-bit got slower here.
[correction -- not true. Windows doesn't have -4way, and the Linux machines are Xeons.]

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq

vess

Full Member

Offline

Activity: 141
Merit: 100

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 19, 2010, 06:41:27 AM

#43

My Core i5 laptop (Ubuntu) doubled in speed. Actually, it didn't double in speed. It stayed the same speed, but only uses half the CPU now. I can't get it to go back to full CPU usage. That said, my laptop is a lot cooler when generating blocks now. I'll post back if I see it successfully go up to 100% usage.

I'm the CEO of CoinLab (www.coinlab.com) and the Executive Director of the Bitcoin Foundation, I will identify if I'm speaking for myself or one of the organizations when I post from this account.

satoshi (OP)

Founder
Sr. Member

Offline

Activity: 364
Merit: 8590

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 19, 2010, 07:07:43 PM

#44

Quote from: Ground Loop on August 18, 2010, 11:14:26 PM

Any non-Mac i5 love?
Windows i5 64-bit got slower here.

That's the first I've heard anyone say i5 was slower. Everyone else has said 4way was faster on i5. Moreso with hyperthreading enabled.

Quote from: nelisky on August 18, 2010, 11:02:25 PM

And i5, at least on my macbookpro

Good, so I take it that's a confirmation that it's working on Mac as well?

Laszlo told me he did compile in the -4way stuff on Mac, so the -4way switch is also available to try on Mac. I don't think makefile.osx on SVN has it yet, just the built version.

nelisky

Legendary

Offline

Activity: 1540
Merit: 1005

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 19, 2010, 08:28:19 PM

#45

Quote from: satoshi on August 19, 2010, 07:07:43 PM

Quote from: nelisky on August 18, 2010, 11:02:25 PM

And i5, at least on my macbookpro

Yep, it's working all right. The number I had posted were from an old svn revision patched with tcatm's changes, but today I compiled trunk and while I had to once again tweak the makefile, after I did it works great with the numbers matching what I experienced before.

Changes I did for my system are below, and while some are cosmetic, like removing wx-config from making bitcoind, just to avoid the warnings if you don't have it installed, others are system specific, like the DEPS dir, and the fact I don't have 32bit libs which makes the link step fail if -arch i386 is there.
The bsddb changes are, I believe, a typo. Includes and Libs point to db46, but then the object list for the linker states db48. Anyway, here's the diff for what got me going:

Code:

Index: makefile.osx
===================================================================
--- makefile.osx	(revision 139)
+++ makefile.osx	(working copy)
@@ -6,29 +6,29 @@
 # Laszlo Hanyecz (solar@heliacal.net)
 
 CXX=llvm-g++
-DEPSDIR=/Users/macosuser/bitcoin/deps
+DEPSDIR=/opt/local
 
 INCLUDEPATHS= \
- -I"$(DEPSDIR)/include"
+ -I"$(DEPSDIR)/include"  -I"$(DEPSDIR)/include/db46"
 
 LIBPATHS= \
- -L"$(DEPSDIR)/lib"
+ -L"$(DEPSDIR)/lib"  -L"$(DEPSDIR)/lib/db46"
 
-WXLIBS=$(shell $(DEPSDIR)/bin/wx-config --libs --static)
+WXLIBS=
 
 LIBS= -dead_strip \
- $(DEPSDIR)/lib/libdb_cxx-4.8.a \
- $(DEPSDIR)/lib/libboost_system.a \
- $(DEPSDIR)/lib/libboost_filesystem.a \
- $(DEPSDIR)/lib/libboost_program_options.a \
- $(DEPSDIR)/lib/libboost_thread.a \
+ $(DEPSDIR)/lib/db46/libdb_cxx-4.6.a \
+ $(DEPSDIR)/lib/libboost_system-mt.a \
+ $(DEPSDIR)/lib/libboost_filesystem-mt.a \
+ $(DEPSDIR)/lib/libboost_program_options-mt.a \
+ $(DEPSDIR)/lib/libboost_thread-mt.a \
  $(DEPSDIR)/lib/libcrypto.a 
 
-DEFS=$(shell $(DEPSDIR)/bin/wx-config --cxxflags) -D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0
+DEFS=-D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0 -DFOURWAYSSE2
 
 DEBUGFLAGS=-g -DwxDEBUG_LEVEL=0
 # ppc doesn't work because we don't support big-endian
-CFLAGS=-mmacosx-version-min=10.5 -arch i386 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS)
+CFLAGS=-mmacosx-version-min=10.5 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS)
 HEADERS=headers.h strlcpy.h serialize.h uint256.h util.h key.h bignum.h base58.h \
     script.h db.h net.h irc.h main.h rpc.h uibase.h ui.h noui.h init.h
 
@@ -42,6 +42,7 @@
     obj/rpc.o \
     obj/init.o \
     cryptopp/obj/sha.o \
+    obj/sha256.o \
     cryptopp/obj/cpu.o
 	
 
@@ -55,7 +56,7 @@
 	$(CXX) -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_ASM -o $@ $<
 
 bitcoin: $(OBJS) obj/ui.o obj/uibase.o
-	$(CXX) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)
+	$(CXX) $(shell $(DEPSDIR)/bin/wx-config --cxxflags) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(shell $(DEPSDIR)/bin/wx-config --libs --static) $(LIBS)
 
 
 obj/nogui/%.o: %.cpp $(HEADERS)

ArtForz

Sr. Member

Offline

Activity: 406
Merit: 257

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 21, 2010, 04:56:31 PM

#46

The difference between new and older CPUs is pretty easy to explain.
Older microarchitectures have 64-bit mmx/sse execution units and split 128bit sse ops into 2 64bit microops.
Newer archs have 128bit sse units.

AMD K8: 2 64bit units
intel Core/Core2: 3 64bit units
AMD K10: 2 128bit units
intel nehalem: 3 128bit units

K10 = Opterons with 4 or more cores, Phenom, Phenom II, Athlon II
nehalem = xeon 34xx/35xx/36xx/55xx/56xx/65xx/75xx, i3/i5/i7

bitcoin: 1Fb77Xq5ePFER8GtKRn2KDbDTVpJKfKmpz
i0coin: jNdvyvd6v6gV3kVJLD7HsB5ZwHyHwAkfdw

satoshi (OP)

Founder
Sr. Member

Offline

Activity: 364
Merit: 8590

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 22, 2010, 11:21:50 PM

#47

Thanks for clearing that up. I read the link someone posted about AMD making that change around 2007, but I didn't know what the story was for Intel.

There's no hope for Core/Core2 then. They only have half the SSE2 hardware.

Strange that Intel has 3 128bit units, but AMD with 2 128bit units is the faster one.

Ground Loop

Member

Offline

Activity: 111
Merit: 10

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 23, 2010, 05:45:17 AM

#48

Intel Atom 230 @ 1.60GHz. Linux 32-bit.
(Acer Aspire Revo)

Stock: 438 khash/sec (1 proc gives 354)
4way: 254 khash/sec

So you can take this one off the powerhouse list.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq

sgtstein

Member

Offline

Activity: 61
Merit: 10

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 24, 2010, 05:31:08 PM

#49

Anybody catch the new AMD Bulldozer press release? If I understand correctly, it should be capable of processing 8 64bit hashes, per core, at the same time. Would be quite a speed boost using this same code design.

Slashdot has the article.
PC Perspective has the details.

Was also covered by AnandTech back in November, 2009.

satoshi (OP)

Founder
Sr. Member

Offline

Activity: 364
Merit: 8590

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 24, 2010, 10:43:56 PM

#50

Quote from: ArtForz on August 21, 2010, 04:56:31 PM

AMD K10: 2 128bit units
intel nehalem: 3 128bit units

This probably explains why hyperthreading increases performance with -4way. If three SSE2 units is excessive, then hyperthreading would help keep them all busy.

tcatm

Sr. Member

Offline

Activity: 337
Merit: 285

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 28, 2010, 12:27:08 AM

#51

I just reviewed the sourcecode as I had a few ideas to optimize it further and I noticed that 4way is partly broken:

from main.cpp:

Code:

                for (int j = 0; j < NPAR; j++) 
                {    
                    if (thash[7][j] == 0)
                    {                        
                        for (int i = 0; i < sizeof(hash)/4; i++) 
                          ((unsigned int*)&hash)[i] = thash[i][j];
                        pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
                    }    
                }

The code will only process one hash (the last with thash[7] == 0) out of 32 hashes even when there is more than one hash that might be a correct one.

Somethine like this should fix it but it won't be safe at higher difficulties. Also, I'm not sure whether the byte order should be reversed or not. Could someone review this?

Code:

                unsigned int min_hash = ~1;
       for (int j = 0; j < NPAR; j++) 
                {    
                    if (thash[7][j] == 0)
                    {    
                        if(thash[6][j] < min_hash) {
                          min_hash = thash[6][j];
                          for (int i = 0; i < sizeof(hash)/4; i++) 
                            ((unsigned int*)&hash)[i] = thash[i][j];
                          pblock->nNonce = ByteReverse(tmp.block.nNonce + j);
                        }    
                    }    
                }

satoshi (OP)

Founder
Sr. Member

Offline

Activity: 364
Merit: 8590

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 28, 2010, 02:27:15 PM

#52

The simplification is intentional. There will only be more than one thash[7]=0 in one out of 134,217,728 cases. It only makes it 0.0000007% slower.

Gespenster

Newbie

Offline

Activity: 15
Merit: 0

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 29, 2010, 11:11:32 AM

#53

@sgtstein: Intel's Sandy Bridge (to be released Q4 2010) will also support AVX 256-bit SIMD registers. That means 8 simultaneous hash calculations/thread would be possible, in principle.

Does anybody has any reports on 4-way SSE2 on the Pentium D (Presler)? What kind of performance can I expect? I have an old Pentium D+mobo laying around and I would fire it up as a mining server if performance would be ok. Probably won't be the most efficient khash/Watt though.

lfm

Full Member

Offline

Activity: 196
Merit: 104

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 29, 2010, 04:55:41 PM

#54

Quote from: Gespenster on August 29, 2010, 11:11:32 AM

Does anybody has any reports on 4-way SSE2 on the Pentium D (Presler)? What kind of performance can I expect? I have an old Pentium D+mobo laying around and I would fire it up as a mining server if performance would be ok. Probably won't be the most efficient khash/Watt though.

My Pentium-D died but it was generally just two P4s in one package and probably will do bitcoin like that. Yes it was terribly power hungry. The 4way code doesn't do very well on P4s in general. I only get about 900 khash/s on a 3.4ghz P4 without -4way. With -4way its in the 600s.

Ground Loop

Member

Offline

Activity: 111
Merit: 10

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

August 31, 2010, 12:48:33 AM

#55

Seriously? Got free electricity?

At Difficulty 623, I've shut down anything under 3000 khash/sec.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq

lfm

Full Member

Offline

Activity: 196
Merit: 104

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

September 01, 2010, 05:23:26 PM

#56

Quote from: Ground Loop on August 31, 2010, 12:48:33 AM

Seriously? Got free electricity?

At Difficulty 623, I've shut down anything under 3000 khash/sec.

I don't have free electricity but I am running a number of electric heaters that look like computers.

The bits produced are a by-product.

Ground Loop

Member

Offline

Activity: 111
Merit: 10

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

September 16, 2010, 07:08:19 PM

#57

4way pays off on one of the HP blade machines..
It's a 12-core Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Running 24 threads with 4way, I get 22,569 khash/sec.

Yow.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq

Legendary

Offline

Activity: 1722
Merit: 1072

P2P Cryptocurrency

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

September 16, 2010, 08:20:19 PM

#58

Cool! This means that the processor in practice can catch up with the video card!

My OpenPGP fingerprint: 5099EB8C0F2E68C63B4ECBB9A9D0993E04143362

Ground Loop

Member

Offline

Activity: 111
Merit: 10

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

September 16, 2010, 09:24:30 PM

#59

It's still not cost effective.
These are HP BL460c blades.. around $6k each. That buys a lot of fresh CUDA!

It's a fun way to do "burn in", but not a smart use of resources.
12 hyperthreading Xeon cores, though.. each.

22,500 khash/sec with -4way, and only 13,400 without, so yeah, it's not subtle.

Bitcoin accepted here: 1HrAmQk9EuH3Ak6ugsw3qi3g23DG6YUNPq

BeeCee1

Member

Offline

Activity: 115
Merit: 10

Re: tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

January 21, 2011, 02:11:10 AM

#60

I made a couple of changes to the sse2 source that sped it up about 5 or 6%:

I changed:
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(_mm_add_epi32(x0, x1), x2), x3)
to
#define add4(x0, x1, x2, x3) _mm_add_epi32(_mm_add_epi32(x0, x1),_mm_add_epi32( x2,x3))

It is just re-ordering the adds. There is a data dependency, each one depends on the result of the one before, the way I reordered it two of the adds are independent. This function is called a lot of times so that little change can add up. (On an older machine it made no difference so YMMV)

A portion of the nonce calculation is repeated over and over, even though the result is the same. I moved
nonce = _mm_set1_epi32(In[3]);
nonce = _mm_add_epi32(nonce, offset);
out of the "for(k = 0; k<NPAR; k+=4) {"

Here's a diff
153c153
> __m128i nonce,preNonce;
---
< __m128i nonce;
157d156
> preNonce = _mm_add_epi32(_mm_set1_epi32(In[3]),offset);
179,182c178,180
> //nonce = _mm_set1_epi32(In[3]);
> //nonce = _mm_add_epi32(nonce, offset);
> //nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));
> nonce = _mm_add_epi32(preNonce,_mm_set1_epi32(k));
---
< nonce = _mm_set1_epi32(In[3]);
< nonce = _mm_add_epi32(nonce, offset);
< nonce = _mm_add_epi32(nonce, _mm_set1_epi32(k));

I have been running this for a couple of days on the mining pool and have generated shares.

Pages: « 1 2 [3] 4 » All

Bitcoin Forum > Bitcoin > Development & Technical Discussion > tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

« previous topic next topic »