If you look at the basic C operations, they all pretty much correspond to a single (or very small number) CPU instruction from the 70s. As TPTB said, it was pretty much intended to be a thin somewhat higher level abstraction but still close to the hardware. Original C wasn't even that portable in the sense that you didn't have things like fixed size integer types and such. To get something close to that today you have to consider intrinsics for new instructions (that didn't exist in the 70s) as part of the language.
The original design of C never included all these highly aggressive optimizations that compilers try to do today, that was all added later. Back in the day, optimizers were largely confined to the realm of FORTRAN. They succeed in some cases for C of course, but its a bit of square peg, round hole.
The optimizations added later, were made possible by more complex code that could be simplified (=unnecessary complexity getting scaled down) plus the increase of usable ram. For example if an unrolled loop was faster than a rolled one, or inlining worked better but cost you more memory (that you now had), it was worth it. But up to an extent (again) because now we have the performance-wall-limits of L1 and L2 cache sizes which are ...of 1970's main ram sizes But in terms of instruction set utilization it's a clusterfuck. In a way we don't even need superoptimizers when we have PGOs for things that have been sampled to run in a predetermined way. You allow the compiler to see PRECISELY what the program does. No "if's" or "hows". It KNOWS what the program does. It KNOWS the logic and flow. It saw it running. You (as a compiler) are allowed to see the executable doing b=b/g bb=bb/g bbb=bbb/g bbbb=bbbb/g ...and you now know that you can pack these 4 into two SIMDs. You didn't even have to see it running, you knew that these were different variables, with different outcomes. You knew they were aligned to the correct size. But even if you had any doubts, you even saw them running anyway with the -fprofile-generate. And still you are not packing these fuckers together after -fprofile-use. And that's the point where I'm furious about. It's just a few simple "if then else" in the heuristics. IF you see instructions that can be packed THEN FUCKING PACK THEM instead of issuing serial scalar / SISD instructions. With AVX the loss is not 2-4x, but 4-8x. It's insane. You don't need to know much about compilers to understand that their optimizations suck. You just see the epic fail that their PGO is and you know how bad heuristics they have, where they can't even tell what can be optimized by knowing full well the flow / logic / speed / bottlenecks, etc of the program. I'm kind of repeating myself over and over for emphasis, but we need to realize that at the point where the profiler knows what the program does, there is no excuses left of the type "but, but, but I don't know if that optimization is safe so I can't risk it". No, now you know. With 100% certainty. (Not that packing 2 into 1 was risky).
|
|
|
Yes, the streaming argument is valid, but the processor is capable of more than that.
Compilers are not superoptimizers. They can't and don't promise to do everything a processor is capable of. Basically that brings up back to the starting point... When C was first created, it promised to be very fast and suitable for creating OS'es, etc. Meaning, its compiler wasn't leaving much performance on the table. With khz of speed and few kbytes of memory there was no room for inefficiency. Granted, the instruction set has expanded greatly since the 70's with FPUs (x387), MMX, SSE(x), AVX(x), AES, etc, but that was the promise. To keep the result close to no overhead (compared to asm). That's what C promised to be. But that has gone out the window as the compilers failed to match the progress and expansion of the cpu's arsenal of tools. We are 15 years after SSE2 and we are still discussing why the hell isn't it using SSE2 in a packed manner. This isn't normal for my standards. Maybe, maybe not. It just apparently hasn't been a priority in the development of GCC. Have you tried icc to see if it does better for example? (I don't know the answer.)
Yes, it's somewhat better but not what I expected. That was in version 12 IIRC. Now it's at 15 or 16, again IIRC. I've actually used clang, icc, amd open64 - they don't have any serious differences. In some apps or cracking stuff, they might. I've seen icc excel in some crypto stuff. It is quite possible that an effort to improve optimization in GCC for the purposes of, for example, cryptographic algorithms would bear fruit. Whether that would be accepted into the compiler given its overall tradeoffs I don't know.
We need a better GCC in general. But that's easy asking when someone else has to code it.
|
|
|
Yes, the streaming argument is valid, but the processor is capable of more than that.
I guess I'm asking too much when I want the compiler to group in a SIMD, similar but separate/non-linear (=safe) operations.
The comparison is not against some compiler that exists in fantasy land, but with real-life asm improvements.
How many cryptographic functions, mining software etc, aren't tweaked for much greater performance? Why? Because the compilers don't do what they have to do.
For example, what's the greatest speed of an sha256 for c and what's the equivalent in c+asm? I'd guess at least twice as fast for the later.
|
|
|
Yeah, the golden brackets of SIMDs... compilers love those, don't they? But they are rarely used if one isn't using arrays.
If my loop was
for 1 to 500mn loops do
b=b/g bb=bb/g bbb=bbb/g bbbb=bbbb/g
...it wouldn't use any packed instructions.
Btw, if I made it 100mn loops x 4 math operations, as the original spec was intended (I did 4 ops x5 times the maths in every loop to compensate for fast finishing speeds - but apparently I won't be using them now with values like 2.432094328043280942 as it goes up to 20+ secs instead of 2), then I'd have to manually unroll the loop and lower the loop count and create arrays. Why? Because without those golden brackets the compiler is useless. You have to write, not as you want to write, but as the compiler wants you to write.
|
|
|
I shouldn't have to manually convert divs to multiplications to get the job done faster. It's kind of elemental. No it isn't elemental and it isn't even a valid optimization (without sacrificing accuracy with -funsafe-math, etc.). I will insist on that. It is elemental. GCC also has the same behavior (converting divs => muls) in very low levels of optimizations because the results are the same. Even at -O1 or -O2. This is *not* reserved for higher level and unsafe optimizations. b=b/g; b=b/g; b=b/g; b=b/g; b=b/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; => -00 (no optimization) output like (divsd = scalar divs of sse): 400744: f2 0f 5e 45 b0 divsd xmm0,QWORD PTR [rbp-0x50] 400749: f2 0f 11 45 f0 movsd QWORD PTR [rbp-0x10],xmm0 40074e: f2 0f 10 45 f0 movsd xmm0,QWORD PTR [rbp-0x10] 400753: f2 0f 5e 45 b0 divsd xmm0,QWORD PTR [rbp-0x50] 400758: f2 0f 11 45 f0 movsd QWORD PTR [rbp-0x10],xmm0 40075d: f2 0f 10 45 f0 movsd xmm0,QWORD PTR [rbp-0x10] 400762: f2 0f 5e 45 b0 divsd xmm0,QWORD PTR [rbp-0x50] 400767: f2 0f 11 45 f0 movsd QWORD PTR [rbp-0x10],xmm0 40076c: f2 0f 10 45 f0 movsd xmm0,QWORD PTR [rbp-0x10] => -01 (low optimization) output like: 400728: f2 0f 10 44 24 20 movsd xmm0,QWORD PTR [rsp+0x20] 40072e: f2 0f 59 c1 mulsd xmm0,xmm1 400732: f2 0f 59 c1 mulsd xmm0,xmm1 400736: f2 0f 59 c1 mulsd xmm0,xmm1 40073a: f2 0f 59 c1 mulsd xmm0,xmm1 40073e: f2 0f 59 c1 mulsd xmm0,xmm1 400742: 66 44 0f 28 d0 movapd xmm10,xmm0 400747: f2 0f 11 44 24 20 movsd QWORD PTR [rsp+0x20],xmm0 40074d: f2 0f 10 44 24 08 movsd xmm0,QWORD PTR [rsp+0x8] 400753: f2 0f 59 c1 mulsd xmm0,xmm1 400757: f2 0f 59 c1 mulsd xmm0,xmm1 40075b: f2 0f 59 c1 mulsd xmm0,xmm1 => -02 and -O3 more of the same, but 20x scalar one after the other (and probably intentionally avoiding xmm0 which in my experience is slower): 40060f: 90 nop 400610: f2 0f 59 e9 mulsd xmm5,xmm1 400614: f2 0f 59 e1 mulsd xmm4,xmm1 400618: f2 0f 59 d9 mulsd xmm3,xmm1 40061c: f2 0f 59 d1 mulsd xmm2,xmm1 400620: f2 0f 59 e9 mulsd xmm5,xmm1 400624: f2 0f 59 e1 mulsd xmm4,xmm1 400628: f2 0f 59 d9 mulsd xmm3,xmm1 40062c: f2 0f 59 d1 mulsd xmm2,xmm1 400630: f2 0f 59 e9 mulsd xmm5,xmm1 400634: f2 0f 59 e1 mulsd xmm4,xmm1 400638: f2 0f 59 d9 mulsd xmm3,xmm1 40063c: f2 0f 59 d1 mulsd xmm2,xmm1 400640: f2 0f 59 e9 mulsd xmm5,xmm1 400644: f2 0f 59 e1 mulsd xmm4,xmm1 400648: f2 0f 59 d9 mulsd xmm3,xmm1 40064c: f2 0f 59 d1 mulsd xmm2,xmm1 400650: f2 0f 59 e9 mulsd xmm5,xmm1 400654: f2 0f 59 e1 mulsd xmm4,xmm1 400658: f2 0f 59 d9 mulsd xmm3,xmm1 40065c: f2 0f 59 d1 mulsd xmm2,xmm1 400660: 66 0f 2e f5 ucomisd xmm6,xmm5 And finally at -Ofast you get ~5 scalar multiplications and the rest are broken down multiplications to ...additions (scalar and packed). Because adds are faster than muls... -Ofast can break down accuracy. I don't know if its related to muls => adds, I'll have to analyze that at some point. But divs => muls is definitely safe. They may not do it with the proposed 1/n (or it could be conditional for n not being zero), but they do it. We are using such code daily if our binaries are compiled with anything above -O0 (and they are). edit: I tried a variation where instead of /g (g=2), (which becomes *0.5), I put a g value like 2.432985742898957284979048059480928509285309285290853029850235942 ... now at that level, up to -O3 it uses divs and only at -Ofast levels it turns them to muls and packed adds. So the difference between gcc and freepascal compilers is that the gcc can understand easily where you can turn divs into muls safely (if high precision is not affected), while freepascal just doesn't do it at all even with safe numbers that don't affect accuracy like /2 (or 0.5).
|
|
|
Did the low-level optimizations violate any of the invariants of the C or C++ code? See the problem is that these languages have corner cases where there are insufficient invariants and thus you don't get what you thought the invariants were.
-Ofast probably... But -fprofile-generate on top of -Ofast shouldn't give me 500ms instead of 1200 (~2.5x speedup). That's totally absurd. It's just running a profiler (with OVERHEAD) to monitor how the flow goes, writing the profile on the disk and then using the profile for the next compilation run. It's not supposed to go faster. It never goes faster with a profiler. That is entirely the wrong conceptualization. The semantics of the C++ code is captured by the type system.
Which, again is a spec on a book. It's theory. Compilation and asm output is all about the compiler, no?
|
|
|
Elegance and comprehensibility via holistic unification of design concepts. You basically have to know the C compiler source code now to know what it will do. The 1000+ pages of specification is a clusterfuck.
Actually in order to understand what the compiler will try to do, you must first have a good grasp of another few thousand pages: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf...and even then, practice will destroy theoretic behavior. I'm seeing this over and over and over. The counterintuitive functionality that results from asm (weird hardware behavior) or compiler parameters (weird translation to asm) is mind-boggling. I run into a very weird bug this morning, and I'm still scratching my head. The GCC compiler has an argument where you can run it with -fprofile-generate. It allows you to run the program in real-time profiling mode, the program will save relevant information on how it runs on a disk file, and then you recompile with -fprofile-use. With -fprofile.use, GCC will read the disk file, see how execution went of the profile-test binary, and re-code the binary (after understanding the logic and knowing what it has to do better) to perform better. So I have this small benchmark that does 100mn loops of 20 divisions by 2. Periodically it bumps up the values so that it continues to have something to divide /2. I time this and see the results. #include <math.h> #include <stdio.h> #include <time.h> int main() { printf("\n");
const double a = 3333333.3456743289; //initial randomly assigned values to start halving const double aa = 4444555.444334244; const double aaa = 6666777.66666666; const double aaaa = 32769999.123458;
unsigned int i; double score; double g; //the number to be used for making the divisions, so essentially halving everything each round
double b; double bb; double bbb; double bbbb;
g = 2;
b = a; bb = aa; bbb = aaa; bbbb = aaaa;
double total_time; clock_t start, end; start = 0; end = 0; score = 0;
start = clock(); for (i = 1; i <100000001; i++) { b=b/g; b=b/g; b=b/g; b=b/g; b=b/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; if (b < 1.0000001) {b=b+i+12.432432432;} //just adding more stuff in order for the number if (bb < 1.0000001) {bb=bb+i+15.4324442;} //to return back to larger values after several if (bbb < 1.0000001) {bbb=bbb+i+19.42884;} //rounds of halving if (bbbb < 1.0000001) {bbbb=bbbb+i+34.481;} }
end = clock();
total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000; score = (10000000 / total_time); printf("\nFinal number: %0.20f", (b+bb+bbb+bbbb)); printf("\nTime elapsed: %0.0f msecs", total_time); printf("\nScore: %0.0f\n", score); return 0; }
(executed in quad q8200 @ 1.75ghz underclock) gcc Maths4asm.c -lm -O0 => 6224ms gcc Maths4asm.c -lm -O2 and -O3 => 1527ms gcc Maths4asm.c -lm -Ofast => 1227ms gcc Maths4asm.c -lm -Ofast -march=nocona => 1236ms gcc Maths4asm.c -lm -Ofast -march=core2 => 1197ms (I have a core quad, technically it's core2 arch) gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-generate => 624ms. gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-generate => 530ms. gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-use => 1258ms (slower than without PGO, slower than -fprofile-generate) gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-use => 1222ms (slower than without PGO, slower than -fprofile-generate). So PGO optimization made it worse (L O L), but the most mindblowing thing is the running of the profiler, getting execution times down to 530ms. The profiler run (-generate) should normally take this to 4000-5000ms or above, as it monitors the process to create a log file. I have never run into a -fprofile-generate build that wasn't at least 2-3 times slower than a normal build - let alone 2-3 times faster. This is totally absurd. And then, to top it all, -fprofile-use (using the logfile to create the best binary) created worse binaries. Oh, and "nocona" (pentium4+) suddenly became ...the better architecture instead of core2 This stuff is almost unbelievable. I thought initially that the profiler must be activating multithreading, but no. I scripted simultaneous use of 4 runs, they all give the same time - that means, there was no extra cpu use in other threads. Add more elegant syntax and less noisy implementation of polymorphism, first-class functions, etc.. Then fix corner cases (e.g. 'const') where the C++ compiler can't give you the correct warning nor (as explained in the video) enforce the programmer's intended semantics at the LLVM optimization layer.
There is no "can't give you". It's simply not programmed to give you. It can be programmed to do whatever you want it to do. From compiling only safe, to giving you the correct warning, to, to, to. You quoted something similar in the post above on how C++ is being "upgraded" to do such stuff. The language is just a syntax written on a book for programmers ("that's how you'll code the X language"), and the text file that the coder codes. But it's all happening in the compiler really. If you have a great compiler => "wow the X language is fast". If you have a very verbose compiler that can actually help you to code better => "wow, the X language is excellent in warnings and helping you code"... if the compiler can switch on / off, 'safe' and 'unsafe' execution styles => "wow the X language is very flexible", etc etc. A language, ultimately, is as good as its compiler - in terms of features. Syntax and structure are different issues and I generally prefer simple-to-read (or write) code instead of high levels of abstractions. It's not that I'm bad at abstract thinking. It's simply more time consuming for me to start searching multiple files to see what each thing does, and then see references to other parts of code, etc etc. How is that supposed to be readable? AlexGR, I think you would be well served by taking a compiler design course and the implementation of both a low-level imperative paradigm and a high-level functional programming paradigm languages. This would start to help you see all the variables involved that you are trying to piecemeal or oversimplify. The issues are exceedingly complex.
Oh they are, I have no doubt about it.
|
|
|
A better compiler certainly could do better with sqrt() in some cases, especially with the flag I mentioned (and even without, given sufficient global analysis, but as I said how much of that to do is somewhat of a judgement call), but I'm just pointing out that the program you fed it was not as simple as it appeared, in terms of what you were asking for.
I'm pretty sure it would choke even if I asked it to do b=b+1 or b*1 bb=bb+1... bbb=bbb+1... bbbb=bbbb+1... Maybe I'll try it out... I made a variant of the program that does 100mn loops of divisions... It was finishing too fast, so I put the divisions 5 times in each loop. b=b/g; //g=2 so it halves every time b=b/g; b=b/g; b=b/g; b=b/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bb=bb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbb=bbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; bbbb=bbbb/g; Pascal compiler binary was awfully slow... in such an arrangement it took ~7s while c, at -O2/-O3 -march=nocona was at 1500ms. When I rearranged them as b=b/g; bb=bb/g; bbb=bbb/g; bbbb=bbbb/g; b=b/g; bb=bb/g; bbb=bbb/g; bbbb=bbbb/g; b=b/g; bb=bb/g; bbb=bbb/g; bbbb=bbbb/g; b=b/g; bb=bb/g; bbb=bbb/g; bbbb=bbbb/g; b=b/g; bb=bb/g; bbb=bbb/g; bbbb=bbbb/g; ...the pascal compiler took the hint that it could do b,bb,bbb,bbbb together, and dropped down to 6 secs. GCC on the other hand was smart enough to understand that each line was not very dependent on the other so it got on with the job - although it still didn't use PACKED sse (=true SIMD), only SCALAR (SISD). I then tried to multiply the result instead of dividing it (multiply by 1/g, which is 0.5, so it's the same as /2). Multiplications are often way faster than divisions. Pascal went down to ~4s that way. It means their compiler sucks because that should be automated anyway - I shouldn't have to manually convert divs to multiplications to get the job done faster. It's kind of elemental. Anyway GCC with C was unaffected. It was already converting the divisions to multiplications at -O2/-O3 levels. Only at -O0 it was around 5-6secs. I then hardwired asm into pascal. Initially scalar multiplications and then packed multiplications - all SSE. Scalar took me down to 3.4s, while packed (actual SIMD use) took me to 2.6s. Final code was like: for i:= 1 to 100000000 do //100mn loop
begin;
asm // THE PACKED WAY / SIMD doing 20 multiplications in 10 instructions= 2680ms
movlpd xmm1, g //the multiplier (value of 0.5) is loaded in xmm1 lower space movhpd xmm1, g //the multiplier (value of 0.5) is loaded in xmm1 higher space movlpd xmm2, b //b is loaded in xmm2 lower space movhpd xmm2, bb //bb is loaded in xmm2 higher space movlpd xmm3, bbb //bbb is loaded in xmm3 lower space movhpd xmm3, bbbb //bbbb is loaded in xmm3 higher space MULPD xmm2, xmm1 //multiply b and bb residing on xmm2 with the multiplier of 0.5 that resides in xmm1 MULPD xmm3, xmm1 //multiply bbb and bbbb residing on xmm2 with the multiplier of 0.5 that resides in xmm1 MULPD xmm2, xmm1 //round 2 MULPD xmm3, xmm1 //round 2 MULPD xmm2, xmm1 //round 3 MULPD xmm3, xmm1 //round 3 MULPD xmm2, xmm1 //round 4 MULPD xmm3, xmm1 //round 4 MULPD xmm2, xmm1 //round 5 MULPD xmm3, xmm1 //round 5 movlpd b, xmm2 //returning results of b, from the lower part of xmm2, back to pascal's b variable movhpd bb, xmm2 //returning results of bb, from the higher part of xmm2, back to pascal's bb variable movlpd bbb, xmm3 //returning results of bbb, from the lower part of xmm3, back to pascal's bbb variable movhpd bbbb, xmm3 //returning results of bbbb, from the higher part of xmm3, back to pascal's bbbb variable
end;
Most of the pascal delays that are taking it up to 2.6s are not related to my code. The loop itself doing zero calculations costs 1.4s by itself, so there is definitely overhead there. Anyway I went back to gcc and c to see what it's doing. At -O3 it was generating MUL SD (sse scalar multiplier / SISD fashion): The 20 divisions had been converted to 20 separate scalar multiplying SSE instructions. So Single Instruction Single Data. Again the compiler fails to pack the data and do them in batches. It's using 20 instructions where it could use 10. Disassembly of section .text:
00000000004005a0 <main>: 4005a0: 53 push %rbx 4005a1: 48 83 ec 20 sub $0x20,%rsp 4005a5: bf 0a 00 00 00 mov $0xa,%edi 4005aa: e8 a1 ff ff ff callq 400550 <putchar@plt> 4005af: e8 ac ff ff ff callq 400560 <clock@plt> 4005b4: 48 89 c3 mov %rax,%rbx 4005b7: f2 0f 10 15 59 03 00 movsd 0x359(%rip),%xmm2 # 400918 <_IO_stdin_used+0x48> 4005be: 00 4005bf: f2 0f 10 05 59 03 00 movsd 0x359(%rip),%xmm0 # 400920 <_IO_stdin_used+0x50> 4005c6: 00 4005c7: f2 0f 10 1d 59 03 00 movsd 0x359(%rip),%xmm3 # 400928 <_IO_stdin_used+0x58> 4005ce: 00 4005cf: f2 0f 10 25 59 03 00 movsd 0x359(%rip),%xmm4 # 400930 <_IO_stdin_used+0x60> 4005d6: 00 4005d7: 31 c0 xor %eax,%eax 4005d9: f2 0f 10 0d 57 03 00 movsd 0x357(%rip),%xmm1 # 400938 <_IO_stdin_used+0x68> 4005e0: 00 4005e1: f2 0f 10 2d 57 03 00 movsd 0x357(%rip),%xmm5 # 400940 <_IO_stdin_used+0x70> 4005e8: 00 4005e9: f2 44 0f 10 0d 56 03 movsd 0x356(%rip),%xmm9 # 400948 <_IO_stdin_used+0x78> 4005f0: 00 00 4005f2: f2 44 0f 10 05 55 03 movsd 0x355(%rip),%xmm8 # 400950 <_IO_stdin_used+0x80> 4005f9: 00 00 4005fb: f2 0f 10 3d 55 03 00 movsd 0x355(%rip),%xmm7 # 400958 <_IO_stdin_used+0x88> 400602: 00 **400603: f2 0f 59 e1 mulsd %xmm1,%xmm4 400607: f2 0f 59 e1 mulsd %xmm1,%xmm4 40060b: f2 0f 59 e1 mulsd %xmm1,%xmm4 40060f: f2 0f 59 e1 mulsd %xmm1,%xmm4 400613: f2 0f 59 e1 mulsd %xmm1,%xmm4 400617: f2 0f 59 d9 mulsd %xmm1,%xmm3 40061b: f2 0f 59 d9 mulsd %xmm1,%xmm3 40061f: f2 0f 59 d9 mulsd %xmm1,%xmm3 400623: f2 0f 59 d9 mulsd %xmm1,%xmm3 400627: f2 0f 59 d9 mulsd %xmm1,%xmm3 40062b: f2 0f 59 c1 mulsd %xmm1,%xmm0 40062f: f2 0f 59 c1 mulsd %xmm1,%xmm0 400633: f2 0f 59 c1 mulsd %xmm1,%xmm0 400637: f2 0f 59 c1 mulsd %xmm1,%xmm0 40063b: f2 0f 59 c1 mulsd %xmm1,%xmm0 40063f: f2 0f 59 d1 mulsd %xmm1,%xmm2 400643: f2 0f 59 d1 mulsd %xmm1,%xmm2 400647: f2 0f 59 d1 mulsd %xmm1,%xmm2 40064b: f2 0f 59 d1 mulsd %xmm1,%xmm2 40064f: f2 0f 59 d1 mulsd %xmm1,%xmm2 400653: 66 0f 2e ec ucomisd %xmm4,%xmm5 400657: 76 11 jbe 40066a <main+0xca> 400659: 66 0f ef f6 pxor %xmm6,%xmm6 40065d: f2 0f 2a f0 cvtsi2sd %eax,%xmm6 400661: f2 0f 58 e6 addsd %xmm6,%xmm4 400665: f2 41 0f 58 e1 addsd %xmm9,%xmm4 40066a: 66 0f 2e eb ucomisd %xmm3,%xmm5 40066e: 76 11 jbe 400681 <main+0xe1> 400670: 66 0f ef f6 pxor %xmm6,%xmm6 400674: f2 0f 2a f0 cvtsi2sd %eax,%xmm6 400678: f2 0f 58 de addsd %xmm6,%xmm3 40067c: f2 41 0f 58 d8 addsd %xmm8,%xmm3 400681: 66 0f 2e ea ucomisd %xmm2,%xmm5 400685: 76 10 jbe 400697 <main+0xf7> 400687: 66 0f ef f6 pxor %xmm6,%xmm6 40068b: f2 0f 2a f0 cvtsi2sd %eax,%xmm6 40068f: f2 0f 58 d6 addsd %xmm6,%xmm2 400693: f2 0f 58 d7 addsd %xmm7,%xmm2 400697: 83 c0 01 add $0x1,%eax 40069a: 3d 00 e1 f5 05 cmp $0x5f5e100,%eax 40069f: 0f 85 5e ff ff ff jne 400603 <main+0x63> 4006a5: f2 0f 11 44 24 18 movsd %xmm0,0x18(%rsp) 4006ab: f2 0f 11 54 24 10 movsd %xmm2,0x10(%rsp) 4006b1: f2 0f 11 5c 24 08 movsd %xmm3,0x8(%rsp) 4006b7: f2 0f 11 24 24 movsd %xmm4,(%rsp) 4006bc: e8 9f fe ff ff callq 400560 <clock@plt> 4006c1: 48 29 d8 sub %rbx,%rax 4006c4: 66 0f ef c9 pxor %xmm1,%xmm1 4006c8: f2 48 0f 2a c8 cvtsi2sd %rax,%xmm1 4006cd: f2 0f 5e 0d 8b 02 00 divsd 0x28b(%rip),%xmm1 # 400960 <_IO_stdin_used+0x90> 4006d4: 00 4006d5: f2 0f 59 0d 8b 02 00 mulsd 0x28b(%rip),%xmm1 # 400968 <_IO_stdin_used+0x98> 4006dc: 00 4006dd: 66 48 0f 7e cb movq %xmm1,%rbx 4006e2: f2 0f 10 24 24 movsd (%rsp),%xmm4 4006e7: f2 0f 10 5c 24 08 movsd 0x8(%rsp),%xmm3 4006ed: f2 0f 58 e3 addsd %xmm3,%xmm4 4006f1: f2 0f 10 44 24 18 movsd 0x18(%rsp),%xmm0 4006f7: f2 0f 58 c4 addsd %xmm4,%xmm0 4006fb: f2 0f 10 54 24 10 movsd 0x10(%rsp),%xmm2 400701: f2 0f 58 c2 addsd %xmm2,%xmm0 400705: bf d4 08 40 00 mov $0x4008d4,%edi 40070a: b8 01 00 00 00 mov $0x1,%eax 40070f: e8 5c fe ff ff callq 400570 <printf@plt> 400714: 66 48 0f 6e c3 movq %rbx,%xmm0 400719: bf ea 08 40 00 mov $0x4008ea,%edi 40071e: b8 01 00 00 00 mov $0x1,%eax 400723: e8 48 fe ff ff callq 400570 <printf@plt> 400728: f2 0f 10 05 40 02 00 movsd 0x240(%rip),%xmm0 # 400970 <_IO_stdin_used+0xa0> 40072f: 00 400730: 66 48 0f 6e fb movq %rbx,%xmm7 400735: f2 0f 5e c7 divsd %xmm7,%xmm0 400739: bf 05 09 40 00 mov $0x400905,%edi 40073e: b8 01 00 00 00 mov $0x1,%eax 400743: e8 28 fe ff ff callq 400570 <printf@plt> 400748: 31 c0 xor %eax,%eax 40074a: 48 83 c4 20 add $0x20,%rsp 40074e: 5b pop %rbx 40074f: c3 retq
At the -Ofast level is the first time where packed instructions start making their appearance (time 1.1s) but they are coupled with a few extra unsafe-math flags for semi-intentional loss of accuracy, and that's problematic. The debug at that level is some scalar muls and a lot of packed additions and packed moves. For some reason it's breaking the divisions down not to 10x packed multiplications but ~5scalar ones and a lot of extra additions (packed). Bottom line: Everyone seems to have a lot to do to get the best out of our hardware. The freepascal compiler is lacking elementary logic in processing divs as multis and has several slow parts. As for C... the sse stuff are there for like 15 years. LOL. When are they gonna (properly*) use them? And how about avx, avx2, etc? Should we wait till 2100? I bet they'll claim "we are taking advantage of AVX" and doing scalar stuff (SISD) there too - wasting 256bit / 512bit width. * One could argue that they are using sse right now, but it's not that useful without exploiting the SIMD capability. I did... the impression I get with all new language projects is that those with high targets often aim to be the next c/c++. The references of the speaker to c++ and how it's very similar (but safer) in many ways confirm that this is what they are having in the back of their mind. "Look, we are like c++ but much safer"... but as he points out at some point, well, if c++ evolves they might go bust. I mean what's your selling point? That your compiler notifies you? And is there anything that prevents a compiler software of c++ to notify the user that what he is doing is unsafe? If they wanted, they could issue a warning or even block compilation altogether on suspected un-safeness. It's doable. It's not a language issue, it's a compiler issue. There could be a compiler flag in c or c++, like --only-allow-safe-code, and suddenly you'd get 100 warnings on how to change your code or it won't compile.
|
|
|
Aztecminer, in the real world what you say makes sense, with the assumption that you are expecting to make money from the clients that will be using your infrastructure. Someone is paying you 100$ per month, you get 20$ profit, you are ok with the upgrade costs - otherwise you go out of business. Right? You are not selling your services for near zero cost. If you did, and you had almost infinite demand as a result, needing upgrades to cope with near infinite demand and near infinite abuse (from the near zero cost situation), you would not upgrade anything. You'd just say "this is ridiculous, I'm going bust". Your priority would definitely not be to service near-zero-cost users and abusers but to make it viable (=fee market). If people say that your network or data center services "don't scale" and that the small guy who wanted your hosting services for 2 cents is "excluded" you'd tell them "cry me a river and fuck off". Why do you want to have something different when bitcoin is concerned? Why should the priority of bitcoin be to service near-zero-cost users and abusers - and do so by upgrading constantly (=giving them more space to abuse, increasing the costs for everyone) with zero tangible benefits? Why do you want to turn the network into an economic amplification attack for those who service it? Why do you want to do with it what you wouldn't do for your own company? There is a significant distinction between "this is an infrastructure that is used and paid and we must upgrade it" and "this is an infrastructure which is already abused due to the extremely low cost of use - so it doesn't make any sense to give, say, x100 space to the abusers". Still, the abusers will have their near-zero-cost party, as the upgrades are coming soon and will "relieve" them of the enormous costs of sub 0.02$ fees that they are now using to spam the network
|
|
|
Quoted for truth
|
|
|
Apparently it didn't compile any better in this case, so that could just be the quality of the compiler, I'm not sure. I also don't remember if Pascal has ranges for floating point as it does for integers?
To get an idea, just doing a 1 to 100mn loop in pascal (with nothing to execute in it) takes 1.4s in my q8200 clocked at 1.75ghz. The c equivalent takes 400ms. If you factor this, then the math code may actually be better in pascal, in terms of speed. (loop+math takes ~3.5-3.8sec in C and ~4.5 in Pascal, but the loop is apparently inefficient - although I can't really make much improvement by unrolling the code x2 and then doing 50mn loops - so it's baffling). The combo of asm+pascal loop=2.2s only, which, if the loop takes 1.4 by itself, means that the math crunching consumes just 0.8? lol? As for the ranges, yes: http://wiki.freepascal.org/Variables_and_Data_TypesA better compiler certainly could do better with sqrt() in some cases, especially with the flag I mentioned (and even without, given sufficient global analysis, but as I said how much of that to do is somewhat of a judgement call), but I'm just pointing out that the program you fed it was not as simple as it appeared, in terms of what you were asking for.
I'm pretty sure it would choke even if I asked it to do b=b+1 or b*1 bb=bb+1... bbb=bbb+1... bbbb=bbbb+1... Maybe I'll try it out because since yesterday I'm trying something else with no success: I'm not sure the deal with Pascal, I never use it.
1) I like the Turbo-pascal-like IDE of Free Pascal in the terminal. It's very productive to me - although I'm not producing much of anything 2) I like the structure, syntax, simplicity and power. 3) See for example how I embedded ASM with my preferred syntax (intel, instead of the more complex at&t). See the elegance. See the interactivity with the program variables without breaking my balls about anything. I just dropped in a few lines as a replacement by asm movlpd xmm1, b movhpd xmm1, bb SQRTPD xmm1, xmm1 movlpd xmm2, bbb movhpd xmm2, bbbb SQRTPD xmm2, xmm2 movlpd b, xmm1 movhpd bb, xmm1 movlpd bbb, xmm2 movhpd bbbb, xmm2 end;
...and IT WORKED. Like a boss. Now, trying to do the same since yesterday with c: //This replaces the c sqrts
asm("movlpd xmm1, b"); asm("movhpd xmm1, bb"); asm("SQRTPD xmm1, xmm1"); asm("movlpd xmm2, bbb"); asm("movhpd xmm2, bbbb"); asm("SQRTPD xmm2, xmm2"); asm("movlpd b, xmm1"); asm("movhpd bb, xmm1"); asm("movlpd bbb, xmm2"); asm("movhpd bbbb, xmm2");
...and the result is: gcc Math3asm.c -lm -masm=intel /tmp/ccNTa80M.o: In function `main': Math3asm.c:(.text+0x4f): undefined reference to `b' Math3asm.c:(.text+0x5c): undefined reference to `bb' Math3asm.c:(.text+0x69): undefined reference to `bbb' Math3asm.c:(.text+0x76): undefined reference to `bbbb' Math3asm.c:(.text+0x91): undefined reference to `b' Math3asm.c:(.text+0x9a): undefined reference to `bb' Math3asm.c:(.text+0xa7): undefined reference to `bbb' Math3asm.c:(.text+0xb0): undefined reference to `bbbb' Math3asm.c:(.text+0xbd): undefined reference to `b' Math3asm.c:(.text+0xc6): undefined reference to `bb' Math3asm.c:(.text+0xcf): undefined reference to `bbb' Math3asm.c:(.text+0xd8): undefined reference to `bbbb' collect2: error: ld returned 1 exit status Ah fuck me with this bullshit. I google to find what's going on and I drop into this: https://gcc.gnu.org/ml/gcc-help/2009-07/msg00044.html...where a guy gets something similar... and here comes da bomb: > Compilation passes - but the linker shouts: "undefined reference to `n'" > > What am I doing wrong? Shouldn't it be straight forward to translate > these simple commands to Linux?
gcc inline assembler does not work like that. You can't simply refer to local variables in the assembler code.
L O L
|
|
|
EDIT: I added something to display the results at the end so it does't drop the entire loop and while using the right compiler options improves things a bit, it is still generating ucomisd which clearly indications some sort of range/error/NaN checking. I didn't go through the code carefully to figure out what it is doing but it suffices to say that sqrt() and 'asm SQRTPD' are not functionally equivalent.
If you write some code that doesn't pull in floating point (especially library functions) minutiae you will often see actual vectorization.
Yeah I can't really tell what it's doing either, but seeing 4x SI MD for 4 variables, well, that's a "winner" right there for "FAIL". If instructions aren't less than the data variables = you are doing it wrong. And that's not related to the various checks btw. It's just a straightforward translation of your source code with four separate sqrt() calls. It is using the SIMD instructions (in a SISD mode) because they are faster than the FPU instructions, as you pointed out. I'm just a "noob" but is it too much to have the audacious expectation where the gcc will actually group things that can be grouped, in order to be processed faster? I mean, I couldn't make it any easier for the compiler in the way I ordered it one after the other without other logic steps interfering and making the compiler question whether it is safe or not to do it (in case other stuff might be dependent on a "sequential" result). Sequential but separate = safe.
|
|
|
EDIT: I added something to display the results at the end so it does't drop the entire loop and while using the right compiler options improves things a bit, it is still generating ucomisd which clearly indications some sort of range/error/NaN checking. I didn't go through the code carefully to figure out what it is doing but it suffices to say that sqrt() and 'asm SQRTPD' are not functionally equivalent.
If you write some code that doesn't pull in floating point (especially library functions) minutiae you will often see actual vectorization.
Yeah I can't really tell what it's doing either, but seeing 4x SI MD for 4 variables, well, that's a "winner" right there for "FAIL". If instructions aren't less than the data variables = you are doing it wrong. And that's not related to the various checks btw.
|
|
|
1. The result is ok, so no problem there with code behavior.
2. The sqrt is already giving SSE code. I could see it in the disassembler. The problem is that it is not in line with the SIMD spirit. Meaning that the whole concept is of a *single* instruction processing *multiple* data.
If I have 4 x sqrt code and 4 x commands reaching the CPU, then where is the SIMD? It's 4 instructions doing 4 data pieces. That's, well, Single Instruction Single Data... and on a 128bit register (using just 64 bit lengths). .
When you see the disassembler giving you 4 SIMD when it should be 2 (because the variables are 64), you know it's all fucked up right there. I could use the 387 unit as well. Actually I did that out of curiosity. It was slower than the SSE. Apparently the SSE unit is better at that.
// The x387 way / 5150ms // // fld b // fsqrt // fstp b // fld bb // fsqrt // fstp bb // fld bbb // fsqrt // fstp bbb // fld bbbb // fsqrt // fstp bbbb
...so back to SSE for doing it right (2 commands, processing 2 data pieces each). If I was using single precision I could do that with 1 command processing 4 data pieces at once (128 bit register fits 4x32bit).
3. My c equivalent code *was* using the c math library - which should be fast, right? Still, very slow at ~3.8s with a normal -O2 build and at best 3.5s after thorough tampering.
|
|
|
I have a logic test in the end (not displayed here) to always check the numbers for correctness. It goes like: Writeln(); Write('Final number: ',b+bb+bbb+bbbb:0:22,' '); if (b+bb+bbb+bbbb) > 4.0000032938759028 then Write('Result [INCORRECT - 4.0000032938759027 expected]'); if (b+bb+bbb+bbbb) < 4.0000032938759026 then Write('Result [INCORRECT- 4.0000032938759027 expected]'); ... anyway the source for c is: #include <math.h> #include <stdio.h> #include <time.h> int main() { printf("\n");
const double a = 911798473; const double aa = 143314345; const double aaa = 531432117; const double aaaa = 343211418; unsigned int i; double score;
double b; double bb; double bbb; double bbbb;
b = a; bb = aa; bbb = aaa; bbbb = aaaa;
double total_time; clock_t start, end; start = clock(); for (i = 0; i <100000000; i++) { b=sqrt (b); bb=sqrt(bb); bbb=sqrt(bbb); bbbb=sqrt(bbbb); if (b <= 1.0000001) {b=b+i+12.432432432;} if (bb <= 1.0000001) {bb=bb+i+15.4324442;} if (bbb <= 1.0000001) {bbb=bbb+i+19.42884;} if (bbbb <= 1.0000001) {bbbb=bbbb+i+34.481;} }
end = clock();
total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000; score = (10000000 / total_time); printf("\nTime elapsed: %0.0f msecs", total_time); printf("\nScore: %0.0f\n", score); return 0; }
And pascal/asm (freepascal / 64 / linux) - including the logic test. {$ASMMODE intel} Uses sysutils;
Const //some randomly chosen constants to begin math functions a: double = 911798473; aa: double = 143314345; aaa: double = 531432117; aaaa: double = 343211418;
Var b,bb,bbb,bbbb: double; //variables that will be used for storing square roots time1,score: single; //how much time the program took, and what the benchmark score is i: longword; //loop counter
Begin Writeln(); //just printing an empty line
b:=a; //begin to assign some large values in order to start finding square roots bb:=aa; bbb:=aaa; bbbb:=aaaa;
sleep(100); // a 100ms delay before we start the timer, so that any I/O has stopped
time1:= GetTickCount64();
for i:= 1 to 100000000 do //100mn loop
begin; asm movlpd xmm1, b //loading the first variable "b" to the lower part of xmm1 movhpd xmm1, bb //loading the second variable "bb" to the higher part of xmm1 SQRTPD xmm1, xmm1 //batch processing both variables for their square root, in the same register, with one SIMD command movlpd xmm2, bbb //loading the third variable "bbb" to the lower part of xmm2 movhpd xmm2, bbbb //loading the fourth variable "bbbb" to the higher part of xmm2 SQRTPD xmm2, xmm2 //batch processing their square roots movlpd b, xmm1 // movhpd bb, xmm1 // Returning all results from the register back to memory (the Pascal program variables) movlpd bbb, xmm2 // movhpd bbbb, xmm2 // end;
{ b:=sqrt(b); // This entire part was replaced with the asm above. bb:=sqrt(bb); // In my machine this code gives me ~4530ms while the asm above gives 2240ms. bbb:=sqrt(bbb); // bbbb:=sqrt(bbbb);} //
if b <= 1.0000001 then b:=b+i+12.432432432; // increase b/bb/bbb/bbb back to higher values by if bb <= 1.0000001 then bb:=bb+i+15.4324442; // adding integers and decimals on them, in order if bbb <= 1.0000001 then bbb:=bbb+i+19.42884; // to keep the variables large and continue the if bbbb <= 1.0000001 then bbbb:=bbbb+i+34.481; // process of finding square roots, instead of the variables going to "1" // due to finite decimal precision. end;
time1:= GetTickCount64() - time1; score:= 10000000 / time1; // Just a way to give a "score" insead of just time elapsed. // Baseline calibration is at 1000 points rewarded for 10000ms delay... // In other words if you finish 5 times faster, say 2000ms, you get 5000 points.
Writeln(); Write('Final number: ',b+bb+bbb+bbbb:0:22,' '); if (b+bb+bbb+bbbb) > 4.0000032938759028 then Write('Result [INCORRECT - 4.0000032938759027 expected]'); //checking result if (b+bb+bbb+bbbb) < 4.0000032938759026 then Write('Result [INCORRECT- 4.0000032938759027 expected]'); //checking result
Writeln(); Writeln('Time elapsed: ',time1:0:0,' msecs.'); // Time elapsed announced to the user Writeln('Score: ', FloatToStr(round(score))); // Score announced to the user End.
|
|
|
What? 200-400 ounces per ton from bottles? And gold being ...synthesized because there are near zero trace amounts in the raw material? lol?
|
|
|
After our recent discussion, I made a small program that calculates square roots, for like 100mn loops (x4 = finding 400mn square roots). When it tends back to 1, it starts adding to the variables in order that it can keep going on with the square roots. I started this to see what the performance difference is between pascal and c (which I avoid like the plague, but anyway) - in terms of binaries (=compiler performance) but then I expanded the experiment to see what is wrong with their SSE use. The code for Pascal, C and ASM (inside the pascal window) here => http://s23.postimg.org/j74spnqc9/wastingtimewithbenchmarks.jpgSo, Pascal, after fiddling around on all available optimizations, gave me ~4.5sec. Interestingly, the debugger (objdump) shows that it uses SSE commands like SQRTPD, but it's doing so in a weird way. C, with GCC 5.3.x, gave me 3.5 - 3.9 secs. Paradoxically, it liked lower -O settings, like -O0... -O1 lost it speed (3.8secs) and -2 / -3 tried to regain it. I also got more performance with -mtune=nocona than -mtune=core2 which is closer (architecturally) to my q8200 and what it takes automatically when -march=native is used. I also tried -msse2,3,mssse3,msse4.1 etc, -mfpmath with all combos, etc, etc, at best it got down to 3.55 secs. The object dumps of the gcc binary didn't enlighten me very much but I could see that it's using the sqr instruction 4 times: The source is: for (i = 0; i <100000000; i++) { b=sqrt (b); bb=sqrt(bb); bbb=sqrt(bbb); bbbb=sqrt(bbbb); and the dump is: 40072e: 0f 84 9b 00 00 00 je 4007cf <main+0x12f> 400734: f2 0f 51 d6 sqrtsd %xmm6,%xmm2 400738: 66 0f 2e d2 ucomisd %xmm2,%xmm2 40073c: 0f 8a 63 02 00 00 jp 4009a5 <main+0x305> 400742: 66 0f 28 f2 movapd %xmm2,%xmm6 400746: f2 0f 51 cd sqrtsd %xmm5,%xmm1 40074a: 66 0f 2e c9 ucomisd %xmm1,%xmm1 40074e: 0f 8a d9 01 00 00 jp 40092d <main+0x28d> 400754: 66 0f 28 e9 movapd %xmm1,%xmm5 400758: f2 0f 51 c7 sqrtsd %xmm7,%xmm0 40075c: 66 0f 2e c0 ucomisd %xmm0,%xmm0 400760: 0f 8a 47 01 00 00 jp 4008ad <main+0x20d> 400766: 66 0f 28 f8 movapd %xmm0,%xmm7 40076a: f2 0f 51 c3 sqrtsd %xmm3,%xmm0 40076e: 66 0f 2e c0 ucomisd %xmm0,%xmm0 400772: 0f 8a b5 00 00 00 jp 40082d <main+0x18d> ...when proper SSE use means it would load two values on the same register and do a batch processing ( =2 commands x 2 data processing on the same registers). So, I went back to Pascal, which I like better for the Turbo Pascal-like IDE in the console, and changed the code over there from: for i:= 1 to 100000000 do b:=sqrt(b); bb:=sqrt(bb); bbb:=sqrt(bbb); bbbb:=sqrt(bbbb); ...to for i:= 1 to 100000000 do //100mn loop begin; movlpd xmm1, b //loading the first variable "b" to the lower part of xmm1 movhpd xmm1, bb //loading the second variable "bb" to the higher part of xmm1 SQRTPD xmm1, xmm1 //batch processing both variables for their square root, with one SIMD command movlpd xmm2, bbb //loading the third variable "bbb" to the lower part of xmm2 movhpd xmm2, bbbb //loading the fourth variable "bbbb" to the higher part of xmm2 SQRTPD xmm2, xmm2 //batch processing their square roots movlpd b, xmm1 // movhpd bb, xmm1 // Returning all results from the register back to the Pascal variables movlpd bbb, xmm2 // movhpd bbbb, xmm2 // ...and voila, my times went down to 2.2s So: Pascal ~4.5s, C ~3.6s, Pascal with simple, rational SSE use by someone who is not even a coder and goes to RTFM of what the SSE instructions do in order to use them = 2.2s. Ladies and gentlemen it is official. Our language compilers SUCK BALLS.I had the 4 variable assignment / sqrt lines, lined up one after another so that it was made extremely easy for the compiler to do a batch processing with SSE. I even issued a #pragma directive to the gcc to force it, and it didn't do anything. No, the compilers "know better". ...That's how "C is a fast language" goes down the drain. With a simple -02 compilation it would be at 3.8s (by "using" SSE, or, more precisely, mis-using them) vs my 2.2s of manual tampering in Pascal. So C became ~70% slower even when faced with almost ideally placed source code that it could exploit. (side by side Pascal / C / ASM inside Pascal): http://s23.postimg.org/j74spnqc9/wastingtimewithbenchmarks.jpg
|
|
|
My bad. I left the browser tab open from yesterday, but as soon as I refreshed I see your number.
Well, it's up to 7400 now just in the last few minutes, so it does seem to be pretty logjammed. Broadcasting txs is free... so it could be 100 million for the lolz (without a strict mempool size). What matters is whether these txs will ever be included - and if they are paying peanuts, or nothing, they shouldn't.
|
|
|
Gold can be made from brown beer bottle glass in a microwave. People have been doing it for years now. The electrons change the glass into gold and other precious metals. Some companies and governments probably makes tons of gold this way secretly. Then they tell you it's rare which is a lie. Precious metals are a scam.
Bitcoin will also fail in a few years IMO.
Retard. Glass is made from silica, gold is an element. Every day the movie Idiocracy becomes closer to reality. Bet you also think we live on a flat earth too What he's saying is partially true. Microwaving helps the separation of fine silica and gold which is trapped in the (raw material) sand that was used. Gold is *everywhere* around us. Every soil or sand, even the ocean water, has some tiny amount of gold. The problem is that it is in the "parts per billion" range. Some sands have higher content and, if there was a method of separation, it would be possible to extract the gold from the glass. And it just so happens that microwaving is able to do it, under circumstances. I am not aware of the cost vs benefit ratios though. The cost for me, for example, would be something like -lost revenue per bottle that could be recycled (I think recyclers over here pay ~0.10 euro or something, per beer bottle - which would be the equivalent of extracting 0.003grams of gold per bottle) -money on industrial mw ovens -money on electricity -money on handling tools and graphite casting equipment or similar, which tend to break due to the glassification of the (remelted) sand -disposal costs of amorphous molten glass (?) ...etc... So is it worth it? Who knows. But the price they are paying per bottle is definitely "fishy". I've often contemplated why they are paying so much for recycling glass beer bottles. edit: And I just noticed you have an avatar of an astronaut drinking from a beer bottle
|
|
|
It is interesting that in this thread we can see that people have 2 different interpretations of the word dream and are communicating ...in parallel. Some refer to fantasizing / day-dreaming, others to sleep dreaming...
|
|
|
|