Bitcoin Forum
May 03, 2024, 02:09:55 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
  Home Help Search Login Register More  
  Show Posts
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 [52] 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 ... 208 »
1021  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 16, 2016, 02:16:09 AM
If you look at the basic C operations, they all pretty much correspond to a single (or very small number) CPU instruction from the 70s. As TPTB said, it was pretty much intended to be a thin somewhat higher level abstraction but still close to the hardware. Original C wasn't even that portable in the sense that you didn't have things like fixed size integer types and such. To get something close to that today you have to consider intrinsics for new instructions (that didn't exist in the 70s) as part of the language.

The original design of C never included all these highly aggressive optimizations that compilers try to do today, that was all added later.  Back in the day, optimizers were largely confined to the realm of FORTRAN. They succeed in some cases for C of course, but its a bit of square peg, round hole.

The optimizations added later, were made possible by more complex code that could be simplified (=unnecessary complexity getting scaled down) plus the increase of usable ram.

For example if an unrolled loop was faster than a rolled one, or inlining worked better but cost you more memory (that you now had), it was worth it. But up to an extent (again) because now we have the performance-wall-limits of L1 and L2 cache sizes which are ...of 1970's main ram sizes Tongue

But in terms of instruction set utilization it's a clusterfuck. In a way we don't even need superoptimizers when we have PGOs for things that have been sampled to run in a predetermined way. You allow the compiler to see PRECISELY what the program does. No "if's" or "hows". It KNOWS what the program does. It KNOWS the logic and flow. It saw it running.

You (as a compiler) are allowed to see the executable doing

b=b/g
bb=bb/g
bbb=bbb/g
bbbb=bbbb/g

...and you now know that you can pack these 4 into two SIMDs. You didn't even have to see it running, you knew that these were different variables, with different outcomes. You knew they were aligned to the correct size. But even if you had any doubts, you even saw them running anyway with the -fprofile-generate. And still you are not packing these fuckers together after -fprofile-use. And that's the point where I'm furious about.

It's just a few simple "if then else" in the heuristics. IF you see instructions that can be packed THEN FUCKING PACK THEM instead of issuing serial scalar / SISD instructions. With AVX the loss is not 2-4x, but 4-8x. It's insane.

You don't need to know much about compilers to understand that their optimizations suck. You just see the epic fail that their PGO is and you know how bad heuristics they have, where they can't even tell what can be optimized by knowing full well the flow / logic / speed / bottlenecks, etc of the program.

I'm kind of repeating myself over and over for emphasis, but we need to realize that at the point where the profiler knows what the program does, there is no excuses left of the type "but, but, but I don't know if that optimization is safe so I can't risk it". No, now you know. With 100% certainty. (Not that packing 2 into 1 was risky).
1022  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 16, 2016, 12:11:01 AM
Yes, the streaming argument is valid, but the processor is capable of more than that.

Compilers are not superoptimizers. They can't and don't promise to do everything a processor is capable of.

Basically that brings up back to the starting point... When C was first created, it promised to be very fast and suitable for creating OS'es, etc. Meaning, its compiler wasn't leaving much performance on the table. With khz of speed and few kbytes of memory there was no room for inefficiency.

Granted, the instruction set has expanded greatly since the 70's with FPUs (x387), MMX, SSE(x), AVX(x), AES, etc, but that was the promise. To keep the result close to no overhead (compared to asm). That's what C promised to be.

But that has gone out the window as the compilers failed to match the progress and expansion of the cpu's arsenal of tools. We are 15 years after SSE2 and we are still discussing why the hell isn't it using SSE2 in a packed manner. This isn't normal for my standards.

Quote
Maybe, maybe not. It just apparently hasn't been a priority in the development of GCC. Have you tried icc to see if it does better for example? (I don't know the answer.)

Yes, it's somewhat better but not what I expected. That was in version 12 IIRC. Now it's at 15 or 16, again IIRC. I've actually used clang, icc, amd open64 - they don't have any serious differences. In some apps or cracking stuff, they might. I've seen icc excel in some crypto stuff.

Quote
It is quite possible that an effort to improve optimization in GCC for the purposes of, for example, cryptographic algorithms would bear fruit. Whether that would be accepted into the compiler given its overall tradeoffs I don't know.

We need a better GCC in general. But that's easy asking when someone else has to code it.
1023  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 15, 2016, 11:53:55 PM
Yes, the streaming argument is valid, but the processor is capable of more than that.

I guess I'm asking too much when I want the compiler to group in a SIMD, similar but separate/non-linear (=safe) operations.

The comparison is not against some compiler that exists in fantasy land, but with real-life asm improvements.

How many cryptographic functions, mining software etc, aren't tweaked for much greater performance? Why? Because the compilers don't do what they have to do.

For example, what's the greatest speed of an sha256 for c and what's the equivalent in c+asm? I'd guess at least twice as fast for the later.
1024  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 15, 2016, 11:39:41 PM
Yeah, the golden brackets of SIMDs... compilers love those, don't they? But they are rarely used if one isn't using arrays.

If my loop was

for 1 to 500mn loops do

b=b/g
bb=bb/g
bbb=bbb/g
bbbb=bbbb/g

...it wouldn't use any packed instructions.

Btw, if I made it 100mn loops x 4 math operations, as the original spec was intended (I did 4 ops x5 times the maths in every loop to compensate for fast finishing speeds - but apparently I won't be using them now with values like 2.432094328043280942 as it goes up to 20+ secs instead of 2), then I'd have to manually unroll the loop and lower the loop count and create arrays. Why? Because without those golden brackets the compiler is useless. You have to write, not as you want to write, but as the compiler wants you to write.
1025  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 15, 2016, 10:26:55 PM
I shouldn't have to manually convert divs to multiplications to get the job done faster. It's kind of elemental.

No it isn't elemental and it isn't even a valid optimization (without sacrificing accuracy with -funsafe-math, etc.).

I will insist on that. It is elemental.

GCC also has the same behavior (converting divs => muls) in very low levels of optimizations because the results are the same. Even at -O1 or -O2. This is *not* reserved for higher level and unsafe optimizations.

b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;

=> -00 (no optimization) output like (divsd = scalar divs of sse):

  400744:   f2 0f 5e 45 b0          divsd  xmm0,QWORD PTR [rbp-0x50]
  400749:   f2 0f 11 45 f0          movsd  QWORD PTR [rbp-0x10],xmm0
  40074e:   f2 0f 10 45 f0          movsd  xmm0,QWORD PTR [rbp-0x10]
  400753:   f2 0f 5e 45 b0          divsd  xmm0,QWORD PTR [rbp-0x50]
  400758:   f2 0f 11 45 f0          movsd  QWORD PTR [rbp-0x10],xmm0
  40075d:   f2 0f 10 45 f0          movsd  xmm0,QWORD PTR [rbp-0x10]
  400762:   f2 0f 5e 45 b0          divsd  xmm0,QWORD PTR [rbp-0x50]
  400767:   f2 0f 11 45 f0          movsd  QWORD PTR [rbp-0x10],xmm0
  40076c:   f2 0f 10 45 f0          movsd  xmm0,QWORD PTR [rbp-0x10]

=> -01 (low optimization) output like:

400728:   f2 0f 10 44 24 20       movsd  xmm0,QWORD PTR [rsp+0x20]
  40072e:   f2 0f 59 c1             mulsd  xmm0,xmm1
  400732:   f2 0f 59 c1             mulsd  xmm0,xmm1
  400736:   f2 0f 59 c1             mulsd  xmm0,xmm1
  40073a:   f2 0f 59 c1             mulsd  xmm0,xmm1
  40073e:   f2 0f 59 c1             mulsd  xmm0,xmm1
  400742:   66 44 0f 28 d0          movapd xmm10,xmm0
  400747:   f2 0f 11 44 24 20       movsd  QWORD PTR [rsp+0x20],xmm0
  40074d:   f2 0f 10 44 24 08       movsd  xmm0,QWORD PTR [rsp+0x8]
  400753:   f2 0f 59 c1             mulsd  xmm0,xmm1
  400757:   f2 0f 59 c1             mulsd  xmm0,xmm1
  40075b:   f2 0f 59 c1             mulsd  xmm0,xmm1

=> -02 and -O3 more of the same, but 20x scalar one after the other (and probably intentionally avoiding xmm0 which in my experience is slower):

  40060f:   90                      nop
  400610:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400614:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400618:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40061c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400620:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400624:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400628:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40062c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400630:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400634:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400638:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40063c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400640:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400644:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400648:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40064c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400650:   f2 0f 59 e9             mulsd  xmm5,xmm1
  400654:   f2 0f 59 e1             mulsd  xmm4,xmm1
  400658:   f2 0f 59 d9             mulsd  xmm3,xmm1
  40065c:   f2 0f 59 d1             mulsd  xmm2,xmm1
  400660:   66 0f 2e f5             ucomisd xmm6,xmm5

And finally at -Ofast you get ~5 scalar multiplications and the rest are broken down multiplications to ...additions (scalar and packed). Because adds are faster than muls...  -Ofast can break down accuracy. I don't know if its related to muls => adds, I'll have to analyze that at some point. But divs => muls is definitely safe. They may not do it with the proposed 1/n (or it could be conditional for n not being zero), but they do it. We are using such code daily if our binaries are compiled with anything above -O0 (and they are).

edit:

I tried a variation where instead of /g (g=2), (which becomes *0.5), I put a g value like 2.432985742898957284979048059480928509285309285290853029850235942 ...

now at that level, up to -O3 it uses divs and only at -Ofast levels it turns them to muls and packed adds.

So the difference between gcc and freepascal compilers is that the gcc can understand easily where you can turn divs into muls safely (if high precision is not affected), while freepascal just doesn't do it at all even with safe numbers that don't affect accuracy like /2 (or 0.5).
1026  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 15, 2016, 06:09:28 PM
Did the low-level optimizations violate any of the invariants of the C or C++ code? See the problem is that these languages have corner cases where there are insufficient invariants and thus you don't get what you thought the invariants were.

-Ofast probably...

But -fprofile-generate on top of -Ofast shouldn't give me 500ms instead of 1200 (~2.5x speedup). That's totally absurd. It's just running a profiler (with OVERHEAD) to monitor how the flow goes, writing the profile on the disk and then using the profile for the next compilation run. It's not supposed to go faster. It never goes faster with a profiler.

That is entirely the wrong conceptualization. The semantics of the C++ code is captured by the type system.

Which, again is a spec on a book. It's theory.

Compilation and asm output is all about the compiler, no?
1027  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 15, 2016, 05:56:02 PM
Elegance and comprehensibility via holistic unification of design concepts. You basically have to know the C compiler source code now to know what it will do. The 1000+ pages of specification is a clusterfuck.

Actually in order to understand what the compiler will try to do, you must first have a good grasp of another few thousand pages: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

...and even then, practice will destroy theoretic behavior. I'm seeing this over and over and over. The counterintuitive functionality that results from asm (weird hardware behavior) or compiler parameters (weird translation to asm) is mind-boggling.

I run into a very weird bug this morning, and I'm still scratching my head. The GCC compiler has an argument where you can run it with -fprofile-generate.

It allows you to run the program in real-time profiling mode, the program will save relevant information on how it runs on a disk file, and then you recompile with -fprofile-use. With -fprofile.use, GCC will read the disk file, see how execution went of the profile-test binary, and re-code the binary (after understanding the logic and knowing what it has to do better) to perform better.

So I have this small benchmark that does 100mn loops of 20 divisions by 2. Periodically it bumps up the values so that it continues to have something to divide /2. I time this and see the results.

Code:
#include <math.h>     
#include <stdio.h>    
#include <time.h>
 
int main()
{
printf("\n");

const double a = 3333333.3456743289;  //initial randomly assigned values to start halving
const double aa = 4444555.444334244;
const double aaa = 6666777.66666666;
const double aaaa = 32769999.123458;

unsigned int i;
double score;
double g; //the number to be used for making the divisions, so essentially halving everything each round

double b;
double bb;
double bbb;
double bbbb;

g = 2;  

b = a;
bb = aa;
bbb = aaa;
bbbb = aaaa;

double total_time;
clock_t start, end;
 
start = 0;
end = 0;
score = 0;

start = clock();
 
 for (i = 1; i <100000001; i++)
 {
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
    
   if (b    < 1.0000001)  {b=b+i+12.432432432;}  //just adding more stuff  in order for the number
   if (bb   < 1.0000001)  {bb=bb+i+15.4324442;}  //to return back to larger values after several
   if (bbb  < 1.0000001)  {bbb=bbb+i+19.42884;}  //rounds of halving
   if (bbbb < 1.0000001)  {bbbb=bbbb+i+34.481;}
}

 end = clock();

 total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000;
  
 score = (10000000 / total_time);
 printf("\nFinal number: %0.20f", (b+bb+bbb+bbbb));
 
 printf("\nTime elapsed: %0.0f msecs", total_time);  
 printf("\nScore: %0.0f\n", score);
 
 return 0;
}

(executed in quad q8200 @ 1.75ghz underclock)

gcc Maths4asm.c -lm -O0  => 6224ms
gcc Maths4asm.c -lm -O2 and -O3  => 1527ms
gcc Maths4asm.c -lm -Ofast  => 1227ms
gcc Maths4asm.c -lm -Ofast -march=nocona => 1236ms
gcc Maths4asm.c -lm -Ofast -march=core2 => 1197ms  (I have a core quad, technically it's core2 arch)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-generate => 624ms.
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-generate => 530ms.
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-use => 1258ms (slower than without PGO, slower than -fprofile-generate)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-use => 1222ms (slower than without PGO, slower than -fprofile-generate).

So PGO optimization made it worse (L O L), but the most mindblowing thing is the running of the profiler, getting execution times down to 530ms. The profiler run (-generate) should normally take this to 4000-5000ms or above, as it monitors the process to create a log file. I have never run into a -fprofile-generate build that wasn't at least 2-3 times slower than a normal build - let alone 2-3 times faster. This is totally absurd.

And then, to top it all, -fprofile-use (using the logfile to create the best binary) created worse binaries.

Oh, and "nocona" (pentium4+) suddenly became ...the better architecture instead of core2 Cheesy

This stuff is almost unbelievable. I thought initially that the profiler must be activating multithreading, but no. I scripted simultaneous use of 4 runs, they all give the same time - that means, there was no extra cpu use in other threads.

Quote
Add more elegant syntax and less noisy implementation of polymorphism, first-class functions, etc.. Then fix corner cases (e.g. 'const') where the C++ compiler can't give you the correct warning nor (as explained in the video) enforce the programmer's intended semantics at the LLVM optimization layer.

There is no "can't give you". It's simply not programmed to give you. It can be programmed to do whatever you want it to do. From compiling only safe, to giving you the correct warning, to, to, to. You quoted something similar in the post above on how C++ is being "upgraded" to do such stuff.

The language is just a syntax written on a book for programmers ("that's how you'll code the X language"), and the text file that the coder codes. But it's all happening in the compiler really. If you have a great compiler => "wow the X language is fast". If you have a very verbose compiler that can actually help you to code better => "wow, the X language is excellent in warnings and helping you code"... if the compiler can switch on / off, 'safe' and 'unsafe' execution styles => "wow the X language is very flexible", etc etc. A language, ultimately, is as good as its compiler - in terms of features. Syntax and structure are different issues and I generally prefer simple-to-read (or write) code instead of high levels of abstractions. It's not that I'm bad at abstract thinking. It's simply more time consuming for me to start searching multiple files to see what each thing does, and then see references to other parts of code, etc etc. How is that supposed to be readable?

Quote
AlexGR, I think you would be well served by taking a compiler design course and the implementation of both a low-level imperative paradigm and a high-level functional programming paradigm languages. This would start to help you see all the variables involved that you are trying to piecemeal or oversimplify. The issues are exceedingly complex.

Oh they are, I have no doubt about it.
1028  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 15, 2016, 02:22:20 AM
A better compiler certainly could do better with sqrt() in some cases, especially with the flag I mentioned (and even without, given sufficient global analysis, but as I said how much of that to do is somewhat of a judgement call), but I'm just pointing out that the program you fed it was not as simple as it appeared, in terms of what you were asking for.

I'm pretty sure it would choke even if I asked it to do
b=b+1 or b*1
bb=bb+1...
bbb=bbb+1...
bbbb=bbbb+1...

Maybe I'll try it out...

I made a variant of the program that does 100mn loops of divisions... It was finishing too fast, so I put the divisions 5 times in each loop.

   b=b/g;  //g=2 so it halves every time
   b=b/g;
   b=b/g;
   b=b/g;
   b=b/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bb=bb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;
   bbbb=bbbb/g;

Pascal compiler binary was awfully slow... in such an arrangement it took ~7s while c, at -O2/-O3 -march=nocona was at 1500ms.

When I rearranged them as

   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;
   b=b/g;  
   bb=bb/g;
   bbb=bbb/g;
   bbbb=bbbb/g;

...the pascal compiler took the hint that it could do b,bb,bbb,bbbb together, and dropped down to 6 secs.

GCC on the other hand was smart enough to understand that each line was not very dependent on the other so it got on with the job - although it still didn't use PACKED sse (=true SIMD), only SCALAR (SISD).

I then tried to multiply the result instead of dividing it (multiply by 1/g, which is 0.5, so it's the same as /2). Multiplications are often way faster than divisions. Pascal went down to ~4s that way. It means their compiler sucks because that should be automated anyway - I shouldn't have to manually convert divs to multiplications to get the job done faster. It's kind of elemental. Anyway GCC with C was unaffected. It was already converting the divisions to multiplications at -O2/-O3 levels. Only at -O0 it was around 5-6secs.

I then hardwired asm into pascal. Initially scalar multiplications and then packed multiplications - all SSE. Scalar took me down to 3.4s, while packed (actual SIMD use) took me to 2.6s.

Final code was like:

Code:
for i:= 1 to 100000000 do //100mn loop

   begin;

   asm   // THE PACKED WAY / SIMD doing 20 multiplications in 10 instructions= 2680ms

     movlpd xmm1, g      //the multiplier (value of 0.5) is loaded in xmm1 lower space
     movhpd xmm1, g      //the multiplier (value of 0.5) is loaded in xmm1 higher space
     movlpd xmm2, b      //b is loaded in xmm2 lower space
     movhpd xmm2, bb     //bb is loaded in xmm2 higher space
     movlpd xmm3, bbb    //bbb is loaded in xmm3 lower space
     movhpd xmm3, bbbb   //bbbb is loaded in xmm3 higher space
     MULPD xmm2, xmm1    //multiply b and bb residing on xmm2 with the multiplier of 0.5 that resides in xmm1
     MULPD xmm3, xmm1    //multiply bbb and bbbb residing on xmm2 with the multiplier of 0.5 that resides in xmm1
     MULPD xmm2, xmm1    //round 2
     MULPD xmm3, xmm1    //round 2
     MULPD xmm2, xmm1    //round 3
     MULPD xmm3, xmm1    //round 3
     MULPD xmm2, xmm1    //round 4
     MULPD xmm3, xmm1    //round 4
     MULPD xmm2, xmm1    //round 5
     MULPD xmm3, xmm1    //round 5
     movlpd b, xmm2      //returning results of b, from the lower part of xmm2, back to pascal's b variable
     movhpd bb, xmm2     //returning results of bb, from the higher part of xmm2, back to pascal's bb variable
     movlpd bbb, xmm3    //returning results of bbb, from the lower part of xmm3, back to pascal's bbb variable
     movhpd bbbb, xmm3   //returning results of bbbb, from the higher part of xmm3, back to pascal's bbbb variable

     end;

Most of the pascal delays that are taking it up to 2.6s are not related to my code. The loop itself doing zero calculations costs 1.4s by itself, so there is definitely overhead there.

Anyway I went back to gcc and c to see what it's doing.

At -O3 it was generating MULSD (sse scalar multiplier / SISD fashion):

The 20 divisions had been converted to 20 separate scalar multiplying SSE instructions. So Single Instruction Single Data. Again the compiler fails to pack the data and do them in batches. It's using 20 instructions where it could use 10.

Code:
Disassembly of section .text:

00000000004005a0 <main>:
  4005a0: 53                   push   %rbx
  4005a1: 48 83 ec 20           sub    $0x20,%rsp
  4005a5: bf 0a 00 00 00       mov    $0xa,%edi
  4005aa: e8 a1 ff ff ff       callq  400550 <putchar@plt>
  4005af: e8 ac ff ff ff       callq  400560 <clock@plt>
  4005b4: 48 89 c3             mov    %rax,%rbx
  4005b7: f2 0f 10 15 59 03 00 movsd  0x359(%rip),%xmm2        # 400918 <_IO_stdin_used+0x48>
  4005be: 00
  4005bf: f2 0f 10 05 59 03 00 movsd  0x359(%rip),%xmm0        # 400920 <_IO_stdin_used+0x50>
  4005c6: 00
  4005c7: f2 0f 10 1d 59 03 00 movsd  0x359(%rip),%xmm3        # 400928 <_IO_stdin_used+0x58>
  4005ce: 00
  4005cf: f2 0f 10 25 59 03 00 movsd  0x359(%rip),%xmm4        # 400930 <_IO_stdin_used+0x60>
  4005d6: 00
  4005d7: 31 c0                 xor    %eax,%eax
  4005d9: f2 0f 10 0d 57 03 00 movsd  0x357(%rip),%xmm1        # 400938 <_IO_stdin_used+0x68>
  4005e0: 00
  4005e1: f2 0f 10 2d 57 03 00 movsd  0x357(%rip),%xmm5        # 400940 <_IO_stdin_used+0x70>
  4005e8: 00
  4005e9: f2 44 0f 10 0d 56 03 movsd  0x356(%rip),%xmm9        # 400948 <_IO_stdin_used+0x78>
  4005f0: 00 00
  4005f2: f2 44 0f 10 05 55 03 movsd  0x355(%rip),%xmm8        # 400950 <_IO_stdin_used+0x80>
  4005f9: 00 00
  4005fb: f2 0f 10 3d 55 03 00 movsd  0x355(%rip),%xmm7        # 400958 <_IO_stdin_used+0x88>
  400602: 00
**400603: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  400607: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  40060b: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  40060f: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  400613: f2 0f 59 e1           mulsd  %xmm1,%xmm4
  400617: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  40061b: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  40061f: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  400623: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  400627: f2 0f 59 d9           mulsd  %xmm1,%xmm3
  40062b: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  40062f: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  400633: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  400637: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  40063b: f2 0f 59 c1           mulsd  %xmm1,%xmm0
  40063f: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  400643: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  400647: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  40064b: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  40064f: f2 0f 59 d1           mulsd  %xmm1,%xmm2
  400653: 66 0f 2e ec           ucomisd %xmm4,%xmm5
  400657: 76 11                 jbe    40066a <main+0xca>
  400659: 66 0f ef f6           pxor   %xmm6,%xmm6
  40065d: f2 0f 2a f0           cvtsi2sd %eax,%xmm6
  400661: f2 0f 58 e6           addsd  %xmm6,%xmm4
  400665: f2 41 0f 58 e1       addsd  %xmm9,%xmm4
  40066a: 66 0f 2e eb           ucomisd %xmm3,%xmm5
  40066e: 76 11                 jbe    400681 <main+0xe1>
  400670: 66 0f ef f6           pxor   %xmm6,%xmm6
  400674: f2 0f 2a f0           cvtsi2sd %eax,%xmm6
  400678: f2 0f 58 de           addsd  %xmm6,%xmm3
  40067c: f2 41 0f 58 d8       addsd  %xmm8,%xmm3
  400681: 66 0f 2e ea           ucomisd %xmm2,%xmm5
  400685: 76 10                 jbe    400697 <main+0xf7>
  400687: 66 0f ef f6           pxor   %xmm6,%xmm6
  40068b: f2 0f 2a f0           cvtsi2sd %eax,%xmm6
  40068f: f2 0f 58 d6           addsd  %xmm6,%xmm2
  400693: f2 0f 58 d7           addsd  %xmm7,%xmm2
  400697: 83 c0 01             add    $0x1,%eax
  40069a: 3d 00 e1 f5 05       cmp    $0x5f5e100,%eax
  40069f: 0f 85 5e ff ff ff     jne    400603 <main+0x63>
  4006a5: f2 0f 11 44 24 18     movsd  %xmm0,0x18(%rsp)
  4006ab: f2 0f 11 54 24 10     movsd  %xmm2,0x10(%rsp)
  4006b1: f2 0f 11 5c 24 08     movsd  %xmm3,0x8(%rsp)
  4006b7: f2 0f 11 24 24       movsd  %xmm4,(%rsp)
  4006bc: e8 9f fe ff ff       callq  400560 <clock@plt>
  4006c1: 48 29 d8             sub    %rbx,%rax
  4006c4: 66 0f ef c9           pxor   %xmm1,%xmm1
  4006c8: f2 48 0f 2a c8       cvtsi2sd %rax,%xmm1
  4006cd: f2 0f 5e 0d 8b 02 00 divsd  0x28b(%rip),%xmm1        # 400960 <_IO_stdin_used+0x90>
  4006d4: 00
  4006d5: f2 0f 59 0d 8b 02 00 mulsd  0x28b(%rip),%xmm1        # 400968 <_IO_stdin_used+0x98>
  4006dc: 00
  4006dd: 66 48 0f 7e cb       movq   %xmm1,%rbx
  4006e2: f2 0f 10 24 24       movsd  (%rsp),%xmm4
  4006e7: f2 0f 10 5c 24 08     movsd  0x8(%rsp),%xmm3
  4006ed: f2 0f 58 e3           addsd  %xmm3,%xmm4
  4006f1: f2 0f 10 44 24 18     movsd  0x18(%rsp),%xmm0
  4006f7: f2 0f 58 c4           addsd  %xmm4,%xmm0
  4006fb: f2 0f 10 54 24 10     movsd  0x10(%rsp),%xmm2
  400701: f2 0f 58 c2           addsd  %xmm2,%xmm0
  400705: bf d4 08 40 00       mov    $0x4008d4,%edi
  40070a: b8 01 00 00 00       mov    $0x1,%eax
  40070f: e8 5c fe ff ff       callq  400570 <printf@plt>
  400714: 66 48 0f 6e c3       movq   %rbx,%xmm0
  400719: bf ea 08 40 00       mov    $0x4008ea,%edi
  40071e: b8 01 00 00 00       mov    $0x1,%eax
  400723: e8 48 fe ff ff       callq  400570 <printf@plt>
  400728: f2 0f 10 05 40 02 00 movsd  0x240(%rip),%xmm0        # 400970 <_IO_stdin_used+0xa0>
  40072f: 00
  400730: 66 48 0f 6e fb       movq   %rbx,%xmm7
  400735: f2 0f 5e c7           divsd  %xmm7,%xmm0
  400739: bf 05 09 40 00       mov    $0x400905,%edi
  40073e: b8 01 00 00 00       mov    $0x1,%eax
  400743: e8 28 fe ff ff       callq  400570 <printf@plt>
  400748: 31 c0                 xor    %eax,%eax
  40074a: 48 83 c4 20           add    $0x20,%rsp
  40074e: 5b                   pop    %rbx
  40074f: c3                   retq  

At the -Ofast level is the first time where packed instructions start making their appearance (time 1.1s) but they are coupled with a few extra unsafe-math flags for semi-intentional loss of accuracy, and that's problematic. The debug at that level is some scalar muls and a lot of packed additions and packed moves. For some reason it's breaking the divisions down not to 10x packed multiplications but ~5scalar ones and a lot of extra additions (packed).

Bottom line: Everyone seems to have a lot to do to get the best out of our hardware. The freepascal compiler is lacking elementary logic in processing divs as multis and has several slow parts.

As for C... the sse stuff are there for like 15 years. LOL. When are they gonna (properly*) use them? And how about avx, avx2, etc? Should we wait till 2100? I bet they'll claim "we are taking advantage of AVX" and doing scalar stuff (SISD) there too - wasting 256bit / 512bit width.

* One could argue that they are using sse right now, but it's not that useful without exploiting the SIMD capability.


Quote

I did... the impression I get with all new language projects is that those with high targets often aim to be the next c/c++. The references of the speaker to c++ and how it's very similar (but safer) in many ways confirm that this is what they are having in the back of their mind. "Look, we are like c++ but much safer"... but as he points out at some point, well, if c++ evolves they might go bust. I mean what's your selling point? That your compiler notifies you? And is there anything that prevents a compiler software of c++ to notify the user that what he is doing is unsafe? If they wanted, they could issue a warning or even block compilation altogether on suspected un-safeness. It's doable. It's not a language issue, it's a compiler issue. There could be a compiler flag in c or c++, like --only-allow-safe-code, and suddenly you'd get 100 warnings on how to change your code or it won't compile.
1029  Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion on: April 14, 2016, 09:03:18 PM
Aztecminer, in the real world what you say makes sense, with the assumption that you are expecting to make money from the clients that will be using your infrastructure.

Someone is paying you 100$ per month, you get 20$ profit, you are ok with the upgrade costs - otherwise you go out of business. Right?

You are not selling your services for near zero cost. If you did, and you had almost infinite demand as a result, needing upgrades to cope with near infinite demand and near infinite abuse (from the near zero cost situation), you would not upgrade anything. You'd just say "this is ridiculous, I'm going bust".

Your priority would definitely not be to service near-zero-cost users and abusers but to make it viable (=fee market).

If people say that your network or data center services "don't scale" and that the small guy who wanted your hosting services for 2 cents is "excluded" you'd tell them "cry me a river and fuck off".

Why do you want to have something different when bitcoin is concerned? Why should the priority of bitcoin be to service near-zero-cost users and abusers - and do so by upgrading constantly (=giving them more space to abuse, increasing the costs for everyone) with zero tangible benefits? Why do you want to turn the network into an economic amplification attack for those who service it? Why do you want to do with it what you wouldn't do for your own company?

There is a significant distinction between "this is an infrastructure that is used and paid and we must upgrade it" and "this is an infrastructure which is already abused due to the extremely low cost of use - so it doesn't make any sense to give, say, x100 space to the abusers".

Still, the abusers will have their near-zero-cost party, as the upgrades are coming soon and will "relieve" them of the enormous costs of sub 0.02$ fees that they are now using to spam the network Roll Eyes
1030  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 14, 2016, 01:25:14 PM


Quoted for truth Grin
1031  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 14, 2016, 01:23:58 PM
Quote
Apparently it didn't compile any better in this case, so that could just be the quality of the compiler, I'm not sure. I also don't remember if Pascal has ranges for floating point as it does for integers?

To get an idea, just doing a 1 to 100mn loop in pascal (with nothing to execute in it) takes 1.4s in my q8200 clocked at 1.75ghz. The c equivalent takes 400ms.

If you factor this, then the math code may actually be better in pascal, in terms of speed.
(loop+math takes ~3.5-3.8sec in C and ~4.5 in Pascal, but the loop is apparently inefficient - although I can't really make much improvement by unrolling the code x2 and then doing 50mn loops - so it's baffling). The combo of asm+pascal loop=2.2s only, which, if the loop takes 1.4 by itself, means that the math crunching consumes just 0.8? lol?

As for the ranges, yes: http://wiki.freepascal.org/Variables_and_Data_Types


A better compiler certainly could do better with sqrt() in some cases, especially with the flag I mentioned (and even without, given sufficient global analysis, but as I said how much of that to do is somewhat of a judgement call), but I'm just pointing out that the program you fed it was not as simple as it appeared, in terms of what you were asking for.

I'm pretty sure it would choke even if I asked it to do
b=b+1 or b*1
bb=bb+1...
bbb=bbb+1...
bbbb=bbbb+1...

Maybe I'll try it out because since yesterday I'm trying something else with no success:

Quote
I'm not sure the deal with Pascal, I never use it.

1) I like the Turbo-pascal-like IDE of Free Pascal in the terminal. It's very productive to me - although I'm not producing much of anything  Grin
2) I like the structure, syntax, simplicity and power.
3) See for example how I embedded ASM with my preferred syntax (intel, instead of the more complex at&t). See the elegance. See the interactivity with the program variables without breaking my balls about anything.

I just dropped in a few lines as a replacement by

Code:
asm
     movlpd xmm1, b
     movhpd xmm1, bb
     SQRTPD xmm1, xmm1
     movlpd xmm2, bbb
     movhpd xmm2, bbbb
     SQRTPD xmm2, xmm2
     movlpd b, xmm1
     movhpd bb, xmm1
     movlpd bbb, xmm2
     movhpd bbbb, xmm2
 end;

...and IT WORKED. Like a boss.

Now, trying to do the same since yesterday with c:

Code:
//This replaces the c sqrts

     asm("movlpd xmm1, b");      
     asm("movhpd xmm1, bb");    
     asm("SQRTPD xmm1, xmm1");  
     asm("movlpd xmm2, bbb");
     asm("movhpd xmm2, bbbb");
     asm("SQRTPD xmm2, xmm2");
     asm("movlpd b, xmm1");
     asm("movhpd bb, xmm1");
     asm("movlpd bbb, xmm2");
     asm("movhpd bbbb, xmm2");

...and the result is:

gcc Math3asm.c -lm -masm=intel
/tmp/ccNTa80M.o: In function `main':

Math3asm.c:(.text+0x4f): undefined reference to `b'
Math3asm.c:(.text+0x5c): undefined reference to `bb'
Math3asm.c:(.text+0x69): undefined reference to `bbb'
Math3asm.c:(.text+0x76): undefined reference to `bbbb'
Math3asm.c:(.text+0x91): undefined reference to `b'
Math3asm.c:(.text+0x9a): undefined reference to `bb'
Math3asm.c:(.text+0xa7): undefined reference to `bbb'
Math3asm.c:(.text+0xb0): undefined reference to `bbbb'
Math3asm.c:(.text+0xbd): undefined reference to `b'
Math3asm.c:(.text+0xc6): undefined reference to `bb'
Math3asm.c:(.text+0xcf): undefined reference to `bbb'
Math3asm.c:(.text+0xd8): undefined reference to `bbbb'
collect2: error: ld returned 1 exit status

Ah fuck me with this bullshit. I google to find what's going on and I drop into this:

https://gcc.gnu.org/ml/gcc-help/2009-07/msg00044.html

...where a guy gets something similar... and here comes da bomb:

Quote

> Compilation passes - but the linker shouts: "undefined reference to `n'"
>
> What am I doing wrong? Shouldn't it be straight forward to translate
> these simple commands to Linux?


gcc inline assembler does not work like that.  You can't simply refer to
local variables in the assembler code.


L O L
1032  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 14, 2016, 01:06:44 AM
EDIT: I added something to display the results at the end so it does't drop the entire loop and while using the right compiler options improves things a bit, it is still generating  ucomisd which clearly indications some sort of range/error/NaN checking. I didn't go through the code carefully to figure out what it is doing but it suffices to say that sqrt() and 'asm SQRTPD' are not functionally equivalent.

If you write some code that doesn't pull in floating point (especially library functions) minutiae you will often see actual vectorization.

Yeah I can't really tell what it's doing either, but seeing 4x SIMD for 4 variables, well, that's a "winner" right there for "FAIL". If instructions aren't less than the data variables = you are doing it wrong. And that's not related to the various checks btw.

It's just a straightforward translation of your source code with four separate sqrt() calls. It is using the SIMD instructions (in a SISD mode) because they are faster than the FPU instructions, as you pointed out.

I'm just a "noob" but is it too much to have the audacious expectation where the gcc will actually group things that can be grouped, in order to be processed faster? I mean, I couldn't make it any easier for the compiler in the way I ordered it one after the other without other logic steps interfering and making the compiler question whether it is safe or not to do it (in case other stuff might be dependent on a "sequential" result). Sequential but separate = safe.
1033  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 14, 2016, 01:01:16 AM
EDIT: I added something to display the results at the end so it does't drop the entire loop and while using the right compiler options improves things a bit, it is still generating  ucomisd which clearly indications some sort of range/error/NaN checking. I didn't go through the code carefully to figure out what it is doing but it suffices to say that sqrt() and 'asm SQRTPD' are not functionally equivalent.

If you write some code that doesn't pull in floating point (especially library functions) minutiae you will often see actual vectorization.

Yeah I can't really tell what it's doing either, but seeing 4x SIMD for 4 variables, well, that's a "winner" right there for "FAIL". If instructions aren't less than the data variables = you are doing it wrong. And that's not related to the various checks btw.

1034  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 14, 2016, 12:44:33 AM
1. The result is ok, so no problem there with code behavior.

2. The sqrt is already giving SSE code. I could see it in the disassembler. The problem is that it is not in line with the SIMD spirit. Meaning that the whole concept is of a *single* instruction processing *multiple* data.

If I have 4 x sqrt code and 4 x commands reaching the CPU, then where is the SIMD? It's 4 instructions doing 4 data pieces. That's, well, Single Instruction Single Data... and on a 128bit register (using just 64 bit lengths). .

When you see the disassembler giving you 4 SIMD when it should be 2 (because the variables are 64), you know it's all fucked up right there. I could use the 387 unit as well. Actually I did that out of curiosity. It was slower than the SSE. Apparently the SSE unit is better at that.

// The x387 way / 5150ms
//
//   fld  b
//   fsqrt
//   fstp  b
//   fld bb
//   fsqrt
//   fstp bb
//   fld bbb
//   fsqrt
//   fstp bbb
//   fld bbbb
//   fsqrt
//   fstp bbbb

...so back to SSE for doing it right (2 commands, processing 2 data pieces each). If I was using single precision I could do that with 1 command processing 4 data pieces at once (128 bit register fits 4x32bit).

3. My c equivalent code *was* using the c math library - which should be fast, right? Still, very slow at ~3.8s with a normal -O2 build and at best 3.5s after thorough tampering.
1035  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 13, 2016, 11:58:06 PM
I have a logic test in the end (not displayed here) to always check the numbers for correctness.

It goes like:

   Writeln();
   Write('Final number: ',b+bb+bbb+bbbb:0:22,'    ');
   if (b+bb+bbb+bbbb) > 4.0000032938759028 then Write('Result [INCORRECT - 4.0000032938759027 expected]');
   if (b+bb+bbb+bbbb) < 4.0000032938759026 then Write('Result [INCORRECT- 4.0000032938759027 expected]');


...

anyway the source for c is:

Code:
#include <math.h>     
#include <stdio.h>    
#include <time.h>
 
int main()
{
printf("\n");

const double a = 911798473;
const double aa = 143314345;
const double aaa = 531432117;
const double aaaa = 343211418;
    
unsigned int i;
double score;

double b;
double bb;
double bbb;
double bbbb;

b = a;
bb = aa;
bbb = aaa;
bbbb = aaaa;

double total_time;
clock_t start, end;
 
start = clock();
 
 for (i = 0; i <100000000; i++)
 {
   b=sqrt (b);
   bb=sqrt(bb);
   bbb=sqrt(bbb);
   bbbb=sqrt(bbbb);
   if (b    <= 1.0000001)  {b=b+i+12.432432432;}
   if (bb   <= 1.0000001)  {bb=bb+i+15.4324442;}
   if (bbb  <= 1.0000001)  {bbb=bbb+i+19.42884;}
   if (bbbb <= 1.0000001)  {bbbb=bbbb+i+34.481;}
  }

 end = clock();

 total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000;
  
 score = (10000000 / total_time);
 
 printf("\nTime elapsed: %0.0f msecs", total_time);  
 printf("\nScore: %0.0f\n", score);
 
 return 0;
}


And pascal/asm (freepascal / 64 / linux) - including the logic test.

Code:
{$ASMMODE intel}
Uses sysutils;

Const //some randomly chosen constants to begin math functions
a: double = 911798473;
aa: double = 143314345;
aaa: double = 531432117;
aaaa: double = 343211418;

Var
b,bb,bbb,bbbb: double; //variables that will be used for storing square roots
time1,score: single;   //how much time the program took, and what the benchmark score is
i: longword;           //loop counter

Begin
Writeln();             //just printing an empty line

b:=a;                  //begin to assign some large values in order to start finding square roots
bb:=aa;
bbb:=aaa;
bbbb:=aaaa;

sleep(100); // a 100ms delay before we start the timer, so that any I/O has stopped

time1:= GetTickCount64();

for i:= 1 to 100000000 do //100mn loop

   begin;
   asm
     movlpd xmm1, b      //loading the first variable "b" to the lower part of xmm1
     movhpd xmm1, bb     //loading the second variable "bb" to the higher part of xmm1
     SQRTPD xmm1, xmm1   //batch processing both variables for their square root, in the same register, with one SIMD command
     movlpd xmm2, bbb    //loading the third variable "bbb" to the lower part of xmm2
     movhpd xmm2, bbbb   //loading the fourth variable "bbbb" to the higher part of xmm2
     SQRTPD xmm2, xmm2   //batch processing their square roots
     movlpd b, xmm1      //
     movhpd bb, xmm1     // Returning all results from the register back to memory (the Pascal program variables)
     movlpd bbb, xmm2    //
     movhpd bbbb, xmm2   //
    end;

{  b:=sqrt(b);           // This entire part was replaced with the asm above.
   bb:=sqrt(bb);         // In my machine this code gives me ~4530ms while the asm above gives 2240ms.
   bbb:=sqrt(bbb);       //
   bbbb:=sqrt(bbbb);}    //

   if b    <= 1.0000001  then b:=b+i+12.432432432; // increase b/bb/bbb/bbb back to higher values by
   if bb   <= 1.0000001  then bb:=bb+i+15.4324442; // adding integers and decimals on them, in order
   if bbb  <= 1.0000001  then bbb:=bbb+i+19.42884; // to keep the variables large and continue the
   if bbbb <= 1.0000001  then bbbb:=bbbb+i+34.481; // process of finding square roots, instead of the variables going to "1"
                                                   // due to finite decimal precision.
   end;

time1:= GetTickCount64() - time1;
score:= 10000000 / time1;   // Just a way to give a "score" insead of just time elapsed.
                            // Baseline calibration is at 1000 points rewarded for 10000ms delay...
                            // In other words if you finish 5 times faster, say 2000ms, you get 5000 points.

Writeln();
Write('Final number: ',b+bb+bbb+bbbb:0:22,'    ');
if (b+bb+bbb+bbbb) > 4.0000032938759028 then Write('Result [INCORRECT - 4.0000032938759027 expected]'); //checking result
if (b+bb+bbb+bbbb) < 4.0000032938759026 then Write('Result [INCORRECT- 4.0000032938759027 expected]'); //checking result

Writeln();
Writeln('Time elapsed: ',time1:0:0,' msecs.'); // Time elapsed announced to the user
Writeln('Score: ', FloatToStr(round(score)));  // Score announced to the user
End.
1036  Economy / Economics / Re: Bitcoin or gold? on: April 13, 2016, 11:07:13 PM
What? 200-400 ounces per ton from bottles? And gold being ...synthesized because there are near zero trace amounts in the raw material? lol?
1037  Alternate cryptocurrencies / Altcoin Discussion / Re: [neㄘcash, ᨇcash, net⚷eys, or viᖚes?] Name AnonyMint's vapor coin? on: April 13, 2016, 10:50:15 PM
After our recent discussion, I made a small program that calculates square roots, for like 100mn loops (x4 = finding 400mn square roots). When it tends back to 1, it starts adding to the variables in order that it can keep going on with the square roots.

I started this to see what the performance difference is between pascal and c (which I avoid like the plague, but anyway) - in terms of binaries (=compiler performance) but then I expanded the experiment to see what is wrong with their SSE use.

The code for Pascal, C and ASM (inside the pascal window) here =>
http://s23.postimg.org/j74spnqc9/wastingtimewithbenchmarks.jpg

So, Pascal, after fiddling around on all available optimizations, gave me ~4.5sec. Interestingly, the debugger (objdump) shows that it uses SSE commands like SQRTPD, but it's doing so in a weird way.

C, with GCC 5.3.x, gave me 3.5 - 3.9 secs. Paradoxically, it liked lower -O settings, like -O0... -O1 lost it speed (3.8secs) and -2 / -3 tried to regain it. I also got more performance with -mtune=nocona than -mtune=core2 which is closer (architecturally) to my q8200 and what it takes automatically when -march=native is used. I also tried -msse2,3,mssse3,msse4.1 etc, -mfpmath with all combos, etc, etc, at best it got down to 3.55 secs.

The object dumps of the gcc binary didn't enlighten me very much but I could see that it's using the sqr instruction 4 times:

The source is:

 for (i = 0; i <100000000; i++)
 {
   b=sqrt (b);
   bb=sqrt(bb);
   bbb=sqrt(bbb);
   bbbb=sqrt(bbbb);


and the dump is:

40072e:   0f 84 9b 00 00 00       je     4007cf <main+0x12f>
  400734:   f2 0f 51 d6             sqrtsd %xmm6,%xmm2
  400738:   66 0f 2e d2             ucomisd %xmm2,%xmm2
  40073c:   0f 8a 63 02 00 00       jp     4009a5 <main+0x305>
  400742:   66 0f 28 f2             movapd %xmm2,%xmm6
  400746:   f2 0f 51 cd             sqrtsd %xmm5,%xmm1
  40074a:   66 0f 2e c9             ucomisd %xmm1,%xmm1
  40074e:   0f 8a d9 01 00 00       jp     40092d <main+0x28d>
  400754:   66 0f 28 e9             movapd %xmm1,%xmm5
  400758:   f2 0f 51 c7             sqrtsd %xmm7,%xmm0
  40075c:   66 0f 2e c0             ucomisd %xmm0,%xmm0
  400760:   0f 8a 47 01 00 00       jp     4008ad <main+0x20d>
  400766:   66 0f 28 f8             movapd %xmm0,%xmm7
  40076a:   f2 0f 51 c3             sqrtsd %xmm3,%xmm0
  40076e:   66 0f 2e c0             ucomisd %xmm0,%xmm0
  400772:   0f 8a b5 00 00 00       jp     40082d <main+0x18d>

...when proper SSE use means it would load two values on the same register and do a batch processing (=2 commands x 2 data processing on the same registers).

So, I went back to Pascal, which I like better for the Turbo Pascal-like IDE in the console, and changed the code over there from:

for i:= 1 to 100000000 do

b:=sqrt(b);  
bb:=sqrt(bb);        
bbb:=sqrt(bbb);    
bbbb:=sqrt(bbbb);


...to

for i:= 1 to 100000000 do //100mn loop
   begin;
     movlpd xmm1, b      //loading the first variable "b" to the lower part of xmm1
     movhpd xmm1, bb     //loading the second variable "bb" to the higher part of xmm1
     SQRTPD xmm1, xmm1   //batch processing both variables for their square root, with one SIMD command
     movlpd xmm2, bbb    //loading the third variable "bbb" to the lower part of xmm2
     movhpd xmm2, bbbb   //loading the fourth variable "bbbb" to the higher part of xmm2
     SQRTPD xmm2, xmm2   //batch processing their square roots
     movlpd b, xmm1      //
     movhpd bb, xmm1     // Returning all results from the register back to the Pascal variables
     movlpd bbb, xmm2    //
     movhpd bbbb, xmm2   //


...and voila, my times went down to 2.2s

So: Pascal ~4.5s, C ~3.6s, Pascal with simple, rational SSE use by someone who is not even a coder and goes to RTFM of what the SSE instructions do in order to use them = 2.2s.

Ladies and gentlemen it is official. Our language compilers SUCK BALLS.

I had the 4 variable assignment / sqrt lines, lined up one after another so that it was made extremely easy for the compiler to do a batch processing with SSE. I even issued a #pragma directive to the gcc to force it, and it didn't do anything.

No, the compilers "know better". ...That's how "C is a fast language" goes down the drain. With a simple -02 compilation it would be at 3.8s (by "using" SSE, or, more precisely, mis-using them) vs my 2.2s of manual tampering in Pascal. So C became ~70% slower even when faced with almost ideally placed source code that it could exploit.

(side by side Pascal / C / ASM inside Pascal): http://s23.postimg.org/j74spnqc9/wastingtimewithbenchmarks.jpg
1038  Alternate cryptocurrencies / Announcements (Altcoins) / Re: [ANN][DASH] Dash | First Anonymous Coin | Inventor of X11, DGW, Darksend and InstantX on: April 13, 2016, 05:59:57 PM

My bad. I left the browser tab open from yesterday, but as soon as I refreshed I see your number.

Well, it's up to 7400 now just in the last few minutes, so it does seem to be pretty logjammed.

Broadcasting txs is free... so it could be 100 million for the lolz (without a strict mempool size).

What matters is whether these txs will ever be included - and if they are paying peanuts, or nothing, they shouldn't.
1039  Economy / Economics / Re: Bitcoin or gold? on: April 13, 2016, 05:39:43 PM
Gold can be made from brown beer bottle glass in a microwave. People have been doing it for years now. The electrons change the glass into gold and other precious metals. Some companies and governments probably makes tons of gold this way secretly. Then they tell you it's rare which is a lie. Precious metals are a scam.

Bitcoin will also fail in a few years IMO.

Retard.

Glass is made from silica, gold is an element.

Every day the movie Idiocracy becomes closer to reality.

Bet you also think we live on a flat earth too  Grin

What he's saying is partially true.

Microwaving helps the separation of fine silica and gold which is trapped in the (raw material) sand that was used. Gold is *everywhere* around us. Every soil or sand, even the ocean water, has some tiny amount of gold. The problem is that it is in the "parts per billion" range. Some sands have higher content and, if there was a method of separation, it would be possible to extract the gold from the glass. And it just so happens that microwaving is able to do it, under circumstances.

I am not aware of the cost vs benefit ratios though. The cost for me, for example, would be something like

-lost revenue per bottle that could be recycled (I think recyclers over here pay ~0.10 euro or something, per beer bottle - which would be the equivalent of extracting 0.003grams of gold per bottle)
-money on industrial mw ovens
-money on electricity
-money on handling tools and graphite casting equipment or similar, which tend to break due to the glassification of the (remelted) sand
-disposal costs of amorphous molten glass (?)

...etc... So is it worth it? Who knows. But the price they are paying per bottle is definitely "fishy". I've often contemplated why they are paying so much for recycling glass beer bottles.

edit: And I just noticed you have an avatar of an astronaut drinking from a beer bottle Tongue
1040  Bitcoin / Bitcoin Discussion / Re: Ever dreamed about Bitcoin? on: April 13, 2016, 04:35:51 PM
It is interesting that in this thread we can see that people have 2 different interpretations of the word dream and are communicating ...in parallel. Some refer to fantasizing / day-dreaming, others to sleep dreaming...

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 [52] 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 ... 208 »
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!