Bitcoin Forum
December 13, 2024, 01:14:54 AM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 [6]  All
  Print  
Author Topic: Xeon Phi  (Read 36477 times)
DiabloD3
Legendary
*
Offline Offline

Activity: 1162
Merit: 1000


DiabloMiner author


View Profile WWW
June 29, 2012, 04:59:50 AM
 #101

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

AzN1337c0d3r
Full Member
***
Offline Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice


View Profile
June 29, 2012, 05:39:48 AM
 #102

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

I remember reading somewhere that MIC was supposed to be barrel-threaded (ie. fine-grained multithreading), somewhat akin to Ultrasparc T1. I can't find the source now though.

DiabloD3
Legendary
*
Offline Offline

Activity: 1162
Merit: 1000


DiabloMiner author


View Profile WWW
June 29, 2012, 05:57:05 AM
 #103

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

I remember reading somewhere that MIC was supposed to be barrel-threaded (ie. fine-grained multithreading), somewhat akin to Ultrasparc T1. I can't find the source now though.

Thats almost the same trick. Ultrasparc T[1-4]s and newer IBM POWERs have multiple thread decoders, and switch to the next on anything that would block execution. AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

mrb
Legendary
*
Offline Offline

Activity: 1512
Merit: 1028


View Profile WWW
June 29, 2012, 06:25:11 AM
 #104

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Xeon Phi is hyperthreaded (the vendor-neutral term for this is SMT = symmetric multithreading), but as I am sure you know SMT does not increase the performance at all of ALU-bound workloads. Therefore we can ignore SMT when making theoretical estimations of the performance of bitcoin mining.
DiabloD3
Legendary
*
Offline Offline

Activity: 1162
Merit: 1000


DiabloMiner author


View Profile WWW
June 29, 2012, 06:29:12 AM
 #105

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Xeon Phi is hyperthreaded (the vendor-neutral term for this is SMT = symmetric multithreading), but as I am sure you know SMT does not increase the performance at all of ALU-bound workloads. Therefore we can ignore SMT when making theoretical estimations of the performance of bitcoin mining.


Its only hyperthreaded? Thats kinda pointless altogether.

mrb
Legendary
*
Offline Offline

Activity: 1512
Merit: 1028


View Profile WWW
June 29, 2012, 06:54:06 AM
 #106

Its only hyperthreaded? Thats kinda pointless altogether.

I use this term liberally. I have no idea what type of threading Xeon Phi will implement (the GPU way: switching unconditionally to the next thread on each instruction; or the CPU way: switching to the next thread when the current one would wait on memory).

But either way, it does not matter to us. Mining is an embarrassingly parallel workload, so an implementation can be adjusted to fully exploit the ALU resources of Xeon Phi, and whatever type of threading Xeon Phi implement will not add supplemental performance.
2112
Legendary
*
Offline Offline

Activity: 2128
Merit: 1073



View Profile
June 29, 2012, 08:06:03 PM
 #107

Its only hyperthreaded? Thats kinda pointless altogether.
It is both superscalar (2-way) and hyperthreaded (4-way).

Another quote from the same manual:
Quote
0x00 0x16 INSTRUCTIONS_EXECUTED Number of instructions executed (up to two per clock)
0x00 0x17 INSTRUCTIONS_EXECUTED_V_PIPE Number of instructions executed in the V_pipe. The event indicates the number of instructions that were paired.
0x20 0x16 VPU_INSTRUCTIONS_EXECUTED Counts the number of VPU instructions executed in both u- and v-pipes.
0x20 0x17 VPU_INSTRUCTIONS_EXECUTED_V_PIPE Counts the number of VPU instructions that paired and executed in the v-pipe.

As mrb said:
Mining is an embarrassingly parallel workload, so an implementation can be adjusted to fully exploit the ALU resources of Xeon Phi,
but hyperthreading should make the "adjustment" work easier. The threads will not be fighting for cache lines, which is the most common cause for not gaining the performance in hyperthreaded processors.

Anyway, we'll see. I'm not really up to downloading Intel compilers, compiling the code and analyzing the assembly. Maybe in winter?

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0
AzN1337c0d3r
Full Member
***
Offline Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice


View Profile
June 30, 2012, 12:14:22 AM
 #108

switch to the next on anything that would block execution.
That's the definition of coarse-grained multithreading and I believe you're mistaken. No major processor architecture implements coarse-grained multithreading.

The Ultrasparc T1/T2 switches thread on every cycle, which is the definition of fine-grained multithreading.

The IBM Power series have implemented true SMT since Power5.

AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

AMD's implementation is actually SMT if you only regard the int ALU resources. The innovation they made is that they share FP resources across two cores (or they have dedicated integer resources for each core if you want to look at it that way).

Quote
Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

That's fine-grained multithreading then because it's basically switching to a new thread every cycle.

DiabloD3
Legendary
*
Offline Offline

Activity: 1162
Merit: 1000


DiabloMiner author


View Profile WWW
June 30, 2012, 02:42:14 AM
 #109

switch to the next on anything that would block execution.
That's the definition of coarse-grained multithreading and I believe you're mistaken. No major processor architecture implements coarse-grained multithreading.

The Ultrasparc T1/T2 switches thread on every cycle, which is the definition of fine-grained multithreading.

The IBM Power series have implemented true SMT since Power5.

AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

AMD's implementation is actually SMT if you only regard the int ALU resources. The innovation they made is that they share FP resources across two cores (or they have dedicated integer resources for each core if you want to look at it that way).

Quote
Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

That's fine-grained multithreading then because it's basically switching to a new thread every cycle.

I wasn't arguing coarse vs fine. I was just arguing on who does what.

Niagaras really do switch on block, but the newer ones might just switch every cycle now that they have a FP unit per core instead of per socket.

Radeons switch every VLIW clause (which can be up to 128 instructions long) due to the unique register layout.

AMD is SMT if you look at it that way, but I look at it much finer grained than SMT: not only do you get SMT, but instructions are scheduled to run on the next free ALU. Fine grained "switch every cycle" would waste resources as ALUs would have no work to run most of the time.

Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

I mean, what I'm trying to say is, all modern high performance archs use the same set of tricks, but its at what level do they exploit them. Its rumored that POWER in the future (9? 10?) will just be something like 64 threads piping into one core, where that one core has like 128 int ALUs and a similar number of FPUs, which is probably the future of non-lockstep highly parallel programming.

AzN1337c0d3r
Full Member
***
Offline Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice


View Profile
June 30, 2012, 09:46:32 AM
 #110

Quote from: DiabloD3
AMD is SMT if you look at it that way, but I look at it much finer grained than SMT: not only do you get SMT, but instructions are scheduled to run on the next free ALU. Fine grained "switch every cycle" would waste resources as ALUs would have no work to run most of the time.

Quote from: DiabloD3
Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

I think you have a fundamental misunderstanding of what SMT is as evidenced by the quotes above. SMT is neither fine-grained nor coarse-grained multithreading.

SMT is simultaneous multithreading so there is no switching. Instructions from all threads are eligible to be issued at any given clock cycle.

Quote from: DiabloD3
I wasn't arguing coarse vs fine. I was just arguing on who does what.

Niagaras really do switch on block, but the newer ones might just switch every cycle now that they have a FP unit per core instead of per socket.

From Wikipedia's UltraSPARCT1 article:

Quote
Each core is a barrel processor, meaning it switches between available threads each cycle.

It doesn't switch on a block, it switches EVERY cycle. You can confirm this if you read Sun's architecture manuals (T1 has been open-sourced)

DiabloD3
Legendary
*
Offline Offline

Activity: 1162
Merit: 1000


DiabloMiner author


View Profile WWW
June 30, 2012, 10:00:20 AM
 #111

Quote from: DiabloD3
AMD is SMT if you look at it that way, but I look at it much finer grained than SMT: not only do you get SMT, but instructions are scheduled to run on the next free ALU. Fine grained "switch every cycle" would waste resources as ALUs would have no work to run most of the time.

Quote from: DiabloD3
Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

I think you have a fundamental misunderstanding of what SMT is as evidenced by the quotes above. SMT is neither fine-grained nor coarse-grained multithreading.

SMT is simultaneous multithreading so there is no switching. Instructions from all threads are eligible to be issued at any given clock cycle.

Quote from: DiabloD3
I wasn't arguing coarse vs fine. I was just arguing on who does what.

Niagaras really do switch on block, but the newer ones might just switch every cycle now that they have a FP unit per core instead of per socket.

From Wikipedia's UltraSPARCT1 article:

Quote
Each core is a barrel processor, meaning it switches between available threads each cycle.

It doesn't switch on a block, it switches EVERY cycle. You can confirm this if you read Sun's architecture manuals (T1 has been open-sourced)

I know what SMT is, and it only describes one core doing multiple threads simultaniously. It doesn't describe how, coarse or not. T1s, Intel P4s, Intel i*s with HT, Bulldozers, and Radeons can all be described as SMT. They just don't all do it the same way.

As for T1s doing it on block, this is what Sun advertised it as. I'm not surprised their marketing department got it slightly wrong, so I'll let you have that one.

AzN1337c0d3r
Full Member
***
Offline Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice


View Profile
June 30, 2012, 10:12:47 AM
 #112


I know what SMT is, and it only describes one core doing multiple threads simultaniously. It doesn't describe how, coarse or not. T1s, Intel P4s, Intel i*s with HT, Bulldozers, and Radeons can all be described as SMT. They just don't all do it the same way.

You seriously still not do understand what SMT is. If you have Computer Architecture: A Quantitative Approach, I suggest you go flip to the chapter on multithreading and reading it.

Also look at slide 35 from this Powerpoint presentation on threading.

Ultrasparc T1 cannot possibly be SMT. From wikipedia article on SMT:

Quote
The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.

DiabloD3
Legendary
*
Offline Offline

Activity: 1162
Merit: 1000


DiabloMiner author


View Profile WWW
June 30, 2012, 10:17:57 AM
 #113


I know what SMT is, and it only describes one core doing multiple threads simultaniously. It doesn't describe how, coarse or not. T1s, Intel P4s, Intel i*s with HT, Bulldozers, and Radeons can all be described as SMT. They just don't all do it the same way.

You seriously still not do understand what SMT is. If you have Computer Architecture: A Quantitative Approach, I suggest you go flip to the chapter on multithreading and reading it.

Also look at slide 35 from this Powerpoint presentation on threading.

Ultrasparc T1 cannot possibly be SMT. From wikipedia article on SMT:

Quote
The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.

Thats an unusually strict definition of SMT. Issuing instructions from more than one thread at a time to fill load requirements over multiple ALUs is not a requirement to be SMT.

AzN1337c0d3r
Full Member
***
Offline Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice


View Profile
June 30, 2012, 10:19:40 AM
 #114

Thats an unusually strict definition of SMT. Issuing instructions from more than one thread at a time to fill load requirements over multiple ALUs is not a requirement to be SMT.

Dude, why do you think it's called simultaneous multithreading then?

If you have found a less strict definition somewhere from an authoritative source, please do share. All research papers I've read regarding SMT has had that definition.

2112
Legendary
*
Offline Offline

Activity: 2128
Merit: 1073



View Profile
July 10, 2012, 10:00:35 PM
 #115

Just an interesting tidbit I've found on the 2nd pass through the Knights Corner documentation:
Quote
EBX[23:16] = 248; // Maximum number of logical processors
248/4 = 62 not 50.

Why could that be?

1) A leftover from Knights Ferry?
2) Yield with all 62 cores enabled would be zero?
3) Something else?

Please share your guesses.

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0
markodude
Newbie
*
Offline Offline

Activity: 45
Merit: 0


View Profile
November 18, 2012, 09:37:30 AM
 #116

Im getting a shot of one next week, has anyone got the code to get it mining? Thanks
2112
Legendary
*
Offline Offline

Activity: 2128
Merit: 1073



View Profile
November 18, 2012, 12:50:25 PM
 #117

Im getting a shot of one next week, has anyone got the code to get it mining? Thanks
pooler's cpuminer will mine both Bitcoins and Litecons. You'll have to recompile to take advantage of the new instructions and massive multithreading. To fully utilize the long vector units you'll probably need to restructure to loops somewhat.

https://bitcointalk.org/index.php?topic=55038.0

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0
kiyominer
Newbie
*
Offline Offline

Activity: 4
Merit: 0


View Profile
March 27, 2014, 02:42:53 PM
 #118

i ported Pooler's cpuminer to Intel Xeon Phi so it takes advantage of the 512 bits registers and compute 16 hashes at once.
i was able to test is on Intel Xeon Phi series 5100 (60 cores, 240 threads @ 1.053 GHz)

so far, i was able to measure 140 MHash/s (using 240 threads).
it was interesting to note that if using 60 threads (using one thread per core) i was able
to achieve 65 MHash/s, which means there could be some room for optimisation, and 260 MHash/s
could be achieved.

on the other hand, i could only achieve 13.8 MHash/s on 240 threads by using the plain C code
(e.g. no use of 512 bit registers use)

the code (Linux only) can be downloaded at https://github.com/kiyominer/cpuminer

Cheers,

kiyo
redmonski
Full Member
***
Offline Offline

Activity: 201
Merit: 100


View Profile
March 27, 2014, 03:31:40 PM
 #119

i ported Pooler's cpuminer to Intel Xeon Phi so it takes advantage of the 512 bits registers and compute 16 hashes at once.
i was able to test is on Intel Xeon Phi series 5100 (60 cores, 240 threads @ 1.053 GHz)

so far, i was able to measure 140 MHash/s (using 240 threads).
it was interesting to note that if using 60 threads (using one thread per core) i was able
to achieve 65 MHash/s, which means there could be some room for optimisation, and 260 MHash/s
could be achieved.

on the other hand, i could only achieve 13.8 MHash/s on 240 threads by using the plain C code
(e.g. no use of 512 bit registers use)

the code (Linux only) can be downloaded at https://github.com/kiyominer/cpuminer

Cheers,

kiyo

Hi, does it work with scrypt? thanks.

█▀▀▄░▄▀▀▄░▄▀▀▀░▄▀▀░▄▀▀▄░█░░░▄▀▀░▄▀▀▄░▀░█▄░█                           ✦ Crypto For The Layman ✦  •  ✦ Blockchain in a Safe Box ✦
█▀▀░░█▀▀█░░▀▀▄░█░░░█▀▀█░█░░░█░░░█░░█░█░█░▀█      ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
█░░░░█░░█░███▀░▀██░█░░█░███░▀██░▀██▀░█░█░░█         【Please mine on [url=https://pool.pascalpool.org/#
/]PASA 94028-93
kiyominer
Newbie
*
Offline Offline

Activity: 4
Merit: 0


View Profile
March 27, 2014, 03:48:41 PM
 #120

Quote
does it work with scrypt?

sorry, only sha256d is supported at this time.
at first glance, efficient scrypt implementation does not look as easy as sha256d.
Pages: « 1 2 3 4 5 [6]  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!