Does this extend to Intel IGPU mining? I've seen some discussion of the topic and that at one time cgminer supported it but I haven't found anything concrete to build on. If there is somewhere I can start I would like to pursue this.
We haven't worked on Intel IGPU mining support yet. However, since Intel IGPU generally supports OpenCL, it should be possible to add support for these GPUs into sgminer. Our latest branch of sgminer is available here: https://github.com/nicehash/sgminer. Currently we don't have any plans on adding Intel IGPU mining support into sgminer, however this does sound like interesting project and if someone is willing to do the development effort for this integration, we would definitely donate to this effort once integration is complete. Best regards, NiceHash team. Thanks for the quick reply. I tried another version of sgminer and it didn't recognize the IGPU. Since it has apparently worked in the past on cgminer I was hoping the development had already been done and it would simply be an integration effort. I'll keep looking.
|
|
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture https://github.com/NervanaSystems/maxasThanks, that will be useful when I learn how to use it. I'm looking for docs that describe the cuda processir architecture in detail so I can dtetertmine things like how many loads to queue up to fill the pipe, how many executions units, user cache management, etc. That kind of information is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable docs with this kind of info? There is not much info available, but if you disassemble compiled code you will see that the maxwell is superscalar with 2 pipes. 2 instructions per cycle. It's able to execute instructions while writing to memory if the code is in the instruction cache. And you to avoid ALU stalls you need to reorder your instructions carefully. There are vector instructions that can write bigger chunks of memory with fewer instructions... etc etc. The compiler is usually doing a good job here. Little to gain.. Ask DJM34 for more info. He is good in the random stuff... Thanks again. Have you tried interleaving memory accesses with arith instructions so they can be issued the same clock? When copying mem do you issue the first load an the first store immediately after it. Thr first load fills the cache line and the first store waits for the first bytes to become available. Then you can queue up enough loads to fill the pipe and do other things while waiting for mem. Multi-buffering is a given being careful not to overuse regs. If your doing a load, process, and store it's even better because you can have one instruction slot focussed on memory while the other can do the processing. These are things I'd like to try but haven't got the time. Although I've done similar in the past there was no performance tests that could quantify the effect, good or bad. If you think this has merit give it a shot. Like I said if it works just keep it open because I could still implement it myself. The hotter the code segments you choose the bigger the result should be. Some of the assembly routines would be logical targets.
|
|
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made
He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture https://github.com/NervanaSystems/maxasThanks, that will be useful when I learn how to use it. I'm looking for docs that describe the cuda processir architecture in detail so I can dtetertmine things like how many loads to queue up to fill the pipe, how many executions units, user cache management, etc. That kind of information is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable docs with this kind of info?
|
|
|
Hello nicehash, I have noticed you have taken an interest in mining software development including ASIC, AMD, Nvidia, and CPU mining. Does this extend to Intel IGPU mining? I've seen some discussion of the topic and that at one time cgminer supported it but I haven't found anything concrete to build on. If there is somewhere I can start I would like to pursue this. I would also like to plug my recent fork of TPruvot's cpuminer-multi. It doesn't compile on windows yet but I'm working on it. https://bitcointalk.org/index.php?topic=1326803.0
|
|
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made
He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
|
|
|
Currently many of us, are mining with 970, with single or multiple cards. This section will be used as a information index about GTX 970. We will discuss about new coins, new algo, new mining software and profitability. As a new miner, it is too hard to find correct coin to mine. Hope this thread will be a nice place for new miners. Please mention the following: 1. what coin you are currently mining 2. What is your your miner 3. What is the hash you are getting ?
There are experienced miners and devs here. hope they will come to help us too.
I don't think this needs to be specific to the 970. Why not broaden the scope? First which nvidia cards perform better on which algos. For example the 750ti is an oustanding performer when mining lyra2v2. How about AMD cards, CPU mining, various mining software, free and $$$. To answer your questions I have a pair of EVGA 970s and I mine mostly with ccminer-1.5.74-SP_mod. But there are other variations with their own benefits. Also come join us in the miner SW threads. There's lots of discussion about squeezing more hash. I will shamelessly plug my new fork of cpuminer called cpuminer-opt, meaning optimized. It's the fastest cpu miner I am aware of and supports the most algos. https://bitcointalk.org/index.php?topic=1326803.0
|
|
|
joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?
Joblo's optimization impacts CPU validation of any found shares. This is usually insignificant, but since he's also mining with all CPU cores, it did have an impact for him. It was that his CPU mining was slowing down ccminer. Joblo: You're invited for a beer over at #ccminer @freenode: there's friendlier dev talk there, some collaboration now and then, and certainly a lot less BS Thanks for the invite. I plan to join #ccminer (and github, and...) when things settle down, which they are beginning to do. I've been so busy trying to get all the algos supported and delivering the quick optimizations that I'm only now starting to think longer term. I'm working on a design to modulerize algos that doesn't require any base code changes when adding a new algo. But that's a big feature that requires a lot of thought. I have high standards and don't want to present a half-baked plan.
|
|
|
joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?
Mostly from more efficient management of the groestl ctx. Because quark can run groestl twice per round and was running the init function twice every time the hash function was called and it was called in a scanhash loop. That's 2* the number of hash calls for something that only needs to be done once. That was a big boost though I don't recall exactly how much. The reduction in the number of inits also helped other algos like x11. I also created a fast reinit function that skipped the constants. So now a full init is done once when scanhash is called and any subsequent reinits that are necessary are fast. That alone added another 5%. I have another idea to factor out the full init from scanhash so the ctx will be fully initted only once, ever, before entering the thread loop.
|
|
|
Got it working some what just hasn't shown any hashrate yet. Was the CPU working or just idling? Did you try other pools or algos? I've streamlined the check in v3.0.7. The check for SSE2 wasn't working and with the plan to drop seperate generic x86_64 target the SSE check isn't needed anymore. Startimg in 3.0.7 it will display seperatety whether the CPU and build support AES_NI and select the appropriate target. The startup display will also be directly linked to the target selection, previously there were two seperate checks.
|
|
|
While you where trolling my thread I added another 0.4% in the decred algo. I will try to do 5% and include it in my donation miner.
Since I forked cpuminer I've increased performance up to 92 % (x13), 75% (x15), 36% (qubit) and 27% (quark). I can't take credit for all of it because it was just plugging in faster functions that already existed. But all the gains in quark are mine.
|
|
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster. The cpu verification is only done when the gpu find a solution. I know why changing the verification code made things faster. I wa scpumining 8 threads at the time so it was slowing down the CPU. But in ccminer you can just remove the verification. It's there so that you can check if you break the hash when you change something. Tried that in cpuminer, didn't help. I only managed to get another 1% out of c11, not sure why, expected more, will take another look. No other algos benefit from the fast ctx reinit but you should try it in ccminer, the GPU kernel, that is.
|
|
|
You are an ssembly language guy, do you reorder instructions to maximize instruction throughput. It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock, how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed up the hot spots.
This is something the compiler is very good at. The cudacore is a 3 + operation risc processor with up to 256 registers. It is buildt for the compiler.. Sometimes you need to move code around, manually unroll some loops etc.. Verify the result with disassembling. (this is what DJM34 is calling random stuff) But don't let the codesize grow to big, the instruction cache is small. ... While you where trolling my thread I added another 0.4% in the decred algo. I will try to do 5% and include it in my donation miner. I wastalking more about performing loads as soon as possible to give time for mem to respond before you need the data. It also fills the cache line for susequent loads. If cuda supports read priority you can even issue a store before a load and the load will have priority. You just have to watch for register conflicts. There is also issuing different types of instructions on the same clock to improve superscalar operation. These kinds of things are hard for a normal compiler to do because it is specific to each processor, but if anyone can do it it'd cuda because thy have one HW architecture, one run time system and one compiler. And another thing, you trolled me first.
|
|
|
You are an ssembly language guy, do you reorder instructions to maximize instruction throughput. It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock, how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed up the hot spots.
|
|
|
Progress update.
I found 4% more hash in quark and I've tested some of the more obscure algos so another 3.0 update is coming before 3.1. I'll take anorther day to look for more low hanging fruit and to a full suite of testing before releasing. I want this to be super stable.
Then I will start on windows, I promise.
V 3.0.7 almost ready.
Edit
I was checking some stats while testing and here is how much has been gained since the project forked.
quark + 27% qubit + 36 x13 + 92 x15 + 76
It's come a long way.
|
|
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster. The cpu verification is only done when the gpu find a solution. I know why changing the verification code made things faster. I wa scpumining 8 threads at the time so it was slowing down the CPU.
|
|
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster. The cpu verification is only done when the gpu find a solution. I may not have realized I was looking at verification code at the time but I know what it is. Maybe my changes can be applied to the GPU code and you'll get your 30%
|
|
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
My changes have nothing to do with avoiding branches but avoiding work.
|
|
|
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.
|
|
|
when is the last time you delivered 30% in less than an hour?
Today. your quark kernel. Since skein is much faster than groestl we only do skein and throw away 50% of the hashes. if (hash[0] & 0x8) { sph_groestl512_init(&ctx_groestl); sph_groestl512 (&ctx_groestl, (const void*) hash, 64); sph_groestl512_close(&ctx_groestl, (void*) hash); } else { sph_skein512_init(&ctx_skein); sph_skein512 (&ctx_skein, (const void*) hash, 64); sph_skein512_close(&ctx_skein, (void*) hash); } There was an optimization made in cpuminer that if it was determined that a second round of groestl was necessary the existing hashes would be thrown away on the belief it would take longer to complete the second groestl than to start over. It didn't work. However, I might try ccminer's logic. cpuminer uses a state machine as the engine. ccminer just uses a simple if. I'm also going to look at other contexts. selctively reinitializing necessary fields may be quicker thn the current implementation of copying a saved initialiazed context. Both are quicker than what ccminer does.
|
|
|
|