Xeon Phi

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Xeon Phi

June 29, 2012, 04:59:50 AM

#101

Quote from: 2112 on June 29, 2012, 03:11:08 AM

Quote from: mrb on June 29, 2012, 01:38:53 AM

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.

I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's document number: 327357-001

Quote

2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

DiabloMiner

AzN1337c0d3r

Full Member

Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

Re: Xeon Phi

June 29, 2012, 05:39:48 AM

#102

Quote from: DiabloD3 on June 29, 2012, 04:59:50 AM

Quote from: 2112 on June 29, 2012, 03:11:08 AM

Quote from: mrb on June 29, 2012, 01:38:53 AM

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.

I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's document number: 327357-001

Quote

2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

I remember reading somewhere that MIC was supposed to be barrel-threaded (ie. fine-grained multithreading), somewhat akin to Ultrasparc T1. I can't find the source now though.

. ██████████ YoBit.net - Cryptocurrency Exchange - Trade Over 350 coins
. ██████████ << ● $$$ - $$$ - $$$ - $$$ - $$$ - $$$ - $$$ >>
. ██████████ << ● Play DICE! Win 1-5 btc just for 5 mins! >>

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Xeon Phi

June 29, 2012, 05:57:05 AM

#103

Quote from: AzN1337c0d3r on June 29, 2012, 05:39:48 AM

Quote from: DiabloD3 on June 29, 2012, 04:59:50 AM

Quote from: 2112 on June 29, 2012, 03:11:08 AM

Quote from: mrb on June 29, 2012, 01:38:53 AM

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.

I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's document number: 327357-001

Quote

2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

I remember reading somewhere that MIC was supposed to be barrel-threaded (ie. fine-grained multithreading), somewhat akin to Ultrasparc T1. I can't find the source now though.

Thats almost the same trick. Ultrasparc T[1-4]s and newer IBM POWERs have multiple thread decoders, and switch to the next on anything that would block execution. AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

DiabloMiner

mrb

Legendary

Offline

Activity: 1512
Merit: 1034

Re: Xeon Phi

June 29, 2012, 06:25:11 AM

#104

Quote from: 2112 on June 29, 2012, 03:11:08 AM

Quote from: mrb on June 29, 2012, 01:38:53 AM

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.

I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's document number: 327357-001

Quote

2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Xeon Phi is hyperthreaded (the vendor-neutral term for this is SMT = symmetric multithreading), but as I am sure you know SMT does not increase the performance at all of ALU-bound workloads. Therefore we can ignore SMT when making theoretical estimations of the performance of bitcoin mining.

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Xeon Phi

June 29, 2012, 06:29:12 AM

#105

Quote from: mrb on June 29, 2012, 06:25:11 AM

Quote from: 2112 on June 29, 2012, 03:11:08 AM

Quote from: mrb on June 29, 2012, 01:38:53 AM

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.

I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's document number: 327357-001

Quote

2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Its only hyperthreaded? Thats kinda pointless altogether.

DiabloMiner

mrb

Legendary

Offline

Activity: 1512
Merit: 1034

Re: Xeon Phi

June 29, 2012, 06:54:06 AM

#106

Quote from: DiabloD3 on June 29, 2012, 06:29:12 AM

Its only hyperthreaded? Thats kinda pointless altogether.

I use this term liberally. I have no idea what type of threading Xeon Phi will implement (the GPU way: switching unconditionally to the next thread on each instruction; or the CPU way: switching to the next thread when the current one would wait on memory).

But either way, it does not matter to us. Mining is an embarrassingly parallel workload, so an implementation can be adjusted to fully exploit the ALU resources of Xeon Phi, and whatever type of threading Xeon Phi implement will not add supplemental performance.

2112

Legendary

Offline

Activity: 2128
Merit: 1088

Re: Xeon Phi

June 29, 2012, 08:06:03 PM

#107

Quote from: DiabloD3 on June 29, 2012, 06:29:12 AM

Its only hyperthreaded? Thats kinda pointless altogether.

It is both superscalar (2-way) and hyperthreaded (4-way).

Another quote from the same manual:

Quote

0x00 0x16 INSTRUCTIONS_EXECUTED Number of instructions executed (up to two per clock)
0x00 0x17 INSTRUCTIONS_EXECUTED_V_PIPE Number of instructions executed in the V_pipe. The event indicates the number of instructions that were paired.
0x20 0x16 VPU_INSTRUCTIONS_EXECUTED Counts the number of VPU instructions executed in both u- and v-pipes.
0x20 0x17 VPU_INSTRUCTIONS_EXECUTED_V_PIPE Counts the number of VPU instructions that paired and executed in the v-pipe.

As mrb said:

Quote from: mrb on June 29, 2012, 06:54:06 AM

Mining is an embarrassingly parallel workload, so an implementation can be adjusted to fully exploit the ALU resources of Xeon Phi,

but hyperthreading should make the "adjustment" work easier. The threads will not be fighting for cache lines, which is the most common cause for not gaining the performance in hyperthreaded processors.

Anyway, we'll see. I'm not really up to downloading Intel compilers, compiling the code and analyzing the assembly. Maybe in winter?

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0

AzN1337c0d3r

Full Member

Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

Re: Xeon Phi

June 30, 2012, 12:14:22 AM

#108

Quote from: DiabloD3 on June 29, 2012, 05:57:05 AM

switch to the next on anything that would block execution.

That's the definition of coarse-grained multithreading and I believe you're mistaken. No major processor architecture implements coarse-grained multithreading.

The Ultrasparc T1/T2 switches thread on every cycle, which is the definition of fine-grained multithreading.

The IBM Power series have implemented true SMT since Power5.

Quote from: DiabloD3 on June 29, 2012, 05:57:05 AM

AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

AMD's implementation is actually SMT if you only regard the int ALU resources. The innovation they made is that they share FP resources across two cores (or they have dedicated integer resources for each core if you want to look at it that way).

Quote

Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

That's fine-grained multithreading then because it's basically switching to a new thread every cycle.

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Xeon Phi

June 30, 2012, 02:42:14 AM

#109

Quote from: AzN1337c0d3r on June 30, 2012, 12:14:22 AM

Quote from: DiabloD3 on June 29, 2012, 05:57:05 AM

switch to the next on anything that would block execution.

Quote from: DiabloD3 on June 29, 2012, 05:57:05 AM

Quote

That's fine-grained multithreading then because it's basically switching to a new thread every cycle.

I wasn't arguing coarse vs fine. I was just arguing on who does what.

Niagaras really do switch on block, but the newer ones might just switch every cycle now that they have a FP unit per core instead of per socket.

Radeons switch every VLIW clause (which can be up to 128 instructions long) due to the unique register layout.

AMD is SMT if you look at it that way, but I look at it much finer grained than SMT: not only do you get SMT, but instructions are scheduled to run on the next free ALU. Fine grained "switch every cycle" would waste resources as ALUs would have no work to run most of the time.

Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

I mean, what I'm trying to say is, all modern high performance archs use the same set of tricks, but its at what level do they exploit them. Its rumored that POWER in the future (9? 10?) will just be something like 64 threads piping into one core, where that one core has like 128 int ALUs and a similar number of FPUs, which is probably the future of non-lockstep highly parallel programming.

DiabloMiner

AzN1337c0d3r

Full Member

Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

Re: Xeon Phi

June 30, 2012, 09:46:32 AM

#110

Quote from: DiabloD3

AMD is SMT if you look at it that way, but I look at it much finer grained than SMT: not only do you get SMT, but instructions are scheduled to run on the next free ALU. Fine grained "switch every cycle" would waste resources as ALUs would have no work to run most of the time.

Quote from: DiabloD3

Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

I think you have a fundamental misunderstanding of what SMT is as evidenced by the quotes above. SMT is neither fine-grained nor coarse-grained multithreading.

SMT is simultaneous multithreading so there is no switching. Instructions from all threads are eligible to be issued at any given clock cycle.

Quote from: DiabloD3

From Wikipedia's UltraSPARCT1 article:

Quote

Each core is a barrel processor, meaning it switches between available threads each cycle.

It doesn't switch on a block, it switches EVERY cycle. You can confirm this if you read Sun's architecture manuals (T1 has been open-sourced)

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Xeon Phi

June 30, 2012, 10:00:20 AM

#111

Quote from: AzN1337c0d3r on June 30, 2012, 09:46:32 AM

Quote from: DiabloD3

Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

Quote from: DiabloD3

From Wikipedia's UltraSPARCT1 article:

Quote

Each core is a barrel processor, meaning it switches between available threads each cycle.

It doesn't switch on a block, it switches EVERY cycle. You can confirm this if you read Sun's architecture manuals (T1 has been open-sourced)

I know what SMT is, and it only describes one core doing multiple threads simultaniously. It doesn't describe how, coarse or not. T1s, Intel P4s, Intel i*s with HT, Bulldozers, and Radeons can all be described as SMT. They just don't all do it the same way.

As for T1s doing it on block, this is what Sun advertised it as. I'm not surprised their marketing department got it slightly wrong, so I'll let you have that one.

DiabloMiner

AzN1337c0d3r

Full Member

Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

Re: Xeon Phi

June 30, 2012, 10:12:47 AM

#112

Quote from: DiabloD3 on June 30, 2012, 10:00:20 AM

You seriously still not do understand what SMT is. If you have Computer Architecture: A Quantitative Approach, I suggest you go flip to the chapter on multithreading and reading it.

Also look at slide 35 from this Powerpoint presentation on threading.

Ultrasparc T1 cannot possibly be SMT. From wikipedia article on SMT:

Quote

The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.

DiabloD3

Legendary

Offline

Activity: 1162
Merit: 1000

DiabloMiner author

Re: Xeon Phi

June 30, 2012, 10:17:57 AM

#113

Quote from: AzN1337c0d3r on June 30, 2012, 10:12:47 AM

Quote from: DiabloD3 on June 30, 2012, 10:00:20 AM

Quote

Thats an unusually strict definition of SMT. Issuing instructions from more than one thread at a time to fill load requirements over multiple ALUs is not a requirement to be SMT.

DiabloMiner

AzN1337c0d3r

Full Member

Offline

Activity: 238
Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

Re: Xeon Phi

June 30, 2012, 10:19:40 AM

#114

Quote from: DiabloD3 on June 30, 2012, 10:17:57 AM

Thats an unusually strict definition of SMT. Issuing instructions from more than one thread at a time to fill load requirements over multiple ALUs is not a requirement to be SMT.

Dude, why do you think it's called simultaneous multithreading then?

If you have found a less strict definition somewhere from an authoritative source, please do share. All research papers I've read regarding SMT has had that definition.

2112

Legendary

Offline

Activity: 2128
Merit: 1088

Re: Xeon Phi

July 10, 2012, 10:00:35 PM

#115

Just an interesting tidbit I've found on the 2nd pass through the Knights Corner documentation:

Quote

EBX[23:16] = 248; // Maximum number of logical processors

248/4 = 62 not 50.

Why could that be?

1) A leftover from Knights Ferry?
2) Yield with all 62 cores enabled would be zero?
3) Something else?

Please share your guesses.

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0

markodude

Newbie

Offline

Activity: 45
Merit: 0

Re: Xeon Phi

November 18, 2012, 09:37:30 AM

#116

Im getting a shot of one next week, has anyone got the code to get it mining? Thanks

2112

Legendary

Offline

Activity: 2128
Merit: 1088

Re: Xeon Phi

November 18, 2012, 12:50:25 PM

#117

Quote from: markodude on November 18, 2012, 09:37:30 AM

Im getting a shot of one next week, has anyone got the code to get it mining? Thanks

pooler's cpuminer will mine both Bitcoins and Litecons. You'll have to recompile to take advantage of the new instructions and massive multithreading. To fully utilize the long vector units you'll probably need to restructure to loops somewhat.

https://bitcointalk.org/index.php?topic=55038.0

Please comment, critique, criticize or ridicule BIP 2112: https://bitcointalk.org/index.php?topic=54382.0
Long-term mining prognosis: https://bitcointalk.org/index.php?topic=91101.0

kiyominer

Newbie

Offline

Activity: 4
Merit: 0

Re: Xeon Phi

March 27, 2014, 02:42:53 PM

#118

i ported Pooler's cpuminer to Intel Xeon Phi so it takes advantage of the 512 bits registers and compute 16 hashes at once.
i was able to test is on Intel Xeon Phi series 5100 (60 cores, 240 threads @ 1.053 GHz)

so far, i was able to measure 140 MHash/s (using 240 threads).
it was interesting to note that if using 60 threads (using one thread per core) i was able
to achieve 65 MHash/s, which means there could be some room for optimisation, and 260 MHash/s
could be achieved.

on the other hand, i could only achieve 13.8 MHash/s on 240 threads by using the plain C code
(e.g. no use of 512 bit registers use)

the code (Linux only) can be downloaded at https://github.com/kiyominer/cpuminer

Cheers,

kiyo

redmonski

Full Member

Offline

Activity: 201
Merit: 100

Re: Xeon Phi

March 27, 2014, 03:31:40 PM

#119

Quote from: kiyominer on March 27, 2014, 02:42:53 PM

Hi, does it work with scrypt? thanks.

█▀▀▄░▄▀▀▄░▄▀▀▀░▄▀▀░▄▀▀▄░█░░░▄▀▀░▄▀▀▄░▀░█▄░█ ✦ Crypto For The Layman ✦ • ✦ Blockchain in a Safe Box ✦
█▀▀░░█▀▀█░░▀▀▄░█░░░█▀▀█░█░░░█░░░█░░█░█░█░▀█ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
█░░░░█░░█░███▀░▀██░█░░█░███░▀██░▀██▀░█░█░░█ 【Please mine on [url=https://pool.pascalpool.org/#
/]PASA 94028-93

kiyominer

Newbie

Offline

Activity: 4
Merit: 0

Re: Xeon Phi

March 27, 2014, 03:48:41 PM

#120

Quote

does it work with scrypt?

sorry, only sha256d is supported at this time.
at first glance, efficient scrypt implementation does not look as easy as sha256d.

Pages: « 1 2 3 4 5 [6] All

Bitcoin Forum > Other > Archival > CPU/GPU Bitcoin mining hardware > Xeon Phi

« previous topic next topic »