NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner

ghostlander (OP)

Legendary

Offline

Activity: 1239
Merit: 1020

No surrender, no retreat, no regret.

Re: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner

February 25, 2019, 03:32:29 PM

#641

Quote from: sorji077 on February 22, 2019, 01:02:59 PM

@ghostlander can u share the instructions file for windows build：

Native WIN32 build instructions: see windows-build.txt

They were obsolete, so I removed them. There is nothing special for MinGW. Have never tried MSVC with it.

"If you've got a problem and have to spread some coins to make it go away, you've got no problem. You've got an expence." ~ Phoenixcoin (PXC) and Orbitcoin (ORB) and Halcyon (HAL)

sorji077

Jr. Member

Offline

Activity: 42
Merit: 1

Re: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner

March 06, 2019, 05:04:22 AM

#642

Quote from: ghostlander on February 25, 2019, 03:32:29 PM

Quote from: sorji077 on February 22, 2019, 01:02:59 PM

@ghostlander can u share the instructions file for windows build：

Native WIN32 build instructions: see windows-build.txt

They were obsolete, so I removed them. There is nothing special for MinGW. Have never tried MSVC with it.

why I‘m asking its that the xaya team reserve the bounty for your open source miner。
and the algo they use is modified neoscrypt. I notice you are running your own coin, so i'm not sure is good to ask u for help.
but you can get the bounity if u can get it work.

https://forum.xaya.io/topic/27-bounties-pools-xaya-core-exploits/

XAYA Core Bounties :
Mining Software:
build nsgminer for windows > 2000 CHI - https://github.com/xaya/nsgminer

ag1233

Newbie

Offline

Activity: 7
Merit: 0

Re: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner

September 12, 2023, 05:58:47 PM

#643

hi ghostlander,
are you still monitoring this thread?

hi all,

oops, I've posted my comment in the 'wrong' thread
https://bitcointalk.org/index.php?topic=55038.msg62832733#msg62832733
reposting here

recently I tried getting cpuminer to run on a Raspberry Pi 4 (aarch64)
I'm using somewhat older codes (version 2.4) from
https://github.com/ghostlander/cpuminer-neoscrypt

I couldn't figure out how to get it to build with the ARM assembly codes, and apparently scrypt-arm.S seemed to be written for armhf (32 bit ARM microprocessors).
Hence, there could possibly with issues compiling in aarch64 (ARM 64bit instructions and OS)

Among the things I tried, I added the following flags
Code:

Code:

-O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations

I checked and try compiling with "-S" option which makes it generate assembly codes, apparently among the suite of flags used above, it causes GCC to generate assembly codes with NEON SIMD.
This is without specific hand optimized assembly. It may possibly still make NEON assembly with a few less flags (e.g. possibly less -ftree-slp-vectorize, but that I think this is useful even without NEON), but that when missing some of the above flags, NEON assembly isn't generated.

I tried with and without the above flags e.g. just -O2, there is at least a slight difference in hash rates, from about 4 khash per sec on all 4 cores doing Neoscrypt - mining Feathercoin to about 5+ khash per sec with NEON optimised codes by GCC's -ftree-vectorize vectorizer, some 20-30% improvements. And the cpu runs hotter during mining along with the higher hash rates which indicates an improvement in efficiency. This is probably a useful thing to have around as manually writing hand optimized assembly e.g. for scrypt-arm.S would likely take a lot of effort and is likely less portable. granted, -ftree-vectorize won't make the fastest codes, but that the improvement is decent with much less manual efforts needed to make optimized assembly codes.

note that neon codes may possibly not work on some ARM cpus which may not support NEON codes, as I think I chanced upon some specs that says A53 cpus the simd extensions is possibly *optional*.
e.g. it is quite possible that some A53 in the wild e.g. the 'cheap' ones may not have NEON in it, even if they are A53 cpus

It used to be that Raspberry Pis are deemed 'too slow' to do mining but Raspberry Pi4 with A72 ARM cores are just borderline and 'punch above its weight' to mine alongside the big Mhash per seconds gpus, the differences is easily 1:1000 though.

--
By just using those flags mentioned, gcc builds binaries with NEON SIMD using that -ftree-vectorize flag, along with the other flags as otherwise it doesn't turn on SIMD codes.
This is a 'quick and easy' way to at least get some NEON SIMD on aarch64 and it isn't too bad as i've described.

off-topic:
just to add a note, I tried to 'hand optimize' it by making a c source where I re-arrange the c arrays in salsa20 to fall into 'lanes' and using the same -ftree-optimize flags, however, instead of being faster the original codes are optimised better even though it actually used less NEON SIMD codes, i looked closer at the generated neon simd codes, I think the problem is that simply 're-arranging' the arrays won't cut it as between the iterations/loops the array is permuted, so that gets streamed out to memory, this is a bummer I'd think lots of stalls then it gets loaded back from memory into a different permuted array of registers.
While with the original codes, there is actually less SIMD. it seemed -ftree-vectorize and other optimizations simply used the normal registers for part of the codes and passing them into simd registers for some sections of the codes, that in itself is faster than the 'rearranged array' codes.
--
there is a minor gain with -ftree-vectorize for Neoscrypt as it spend a large number of loops in salsa20, 1000 x some 200 rounds?
hence, NEON SIMD could potentially speed that up significantly, the trouble is that salsa20
https://en.wikipedia.org/wiki/Salsa20
permutates, the arrays between the quarter rounds in each loop. I did a naive attempt by simply re-arrange the arrays in C codes so that they looked like they fall into 'lanes' (common to SIMD).
that oversimplified approach don't cut it with -ftree-vectorize, the registers get streamed out into memory (lots of wait states and cpu stalls for the small Raspberry Pi type boards and cpus).
but that hand optimized assembly won't be easy to write and that they'd take quite a lot of effort.
and the thing is this won't be the only thing that needs to be optimized.

Hence, for now the 'easy' way is to simply -ftree-vectorize with the other flags in bundle so that at least some form of NEON SIMD is achieved.
There is a decent gain like 20% (for Neoscrypt) with vs without the compiler generated SIMD codes.

ghostlander (OP)

Legendary

Offline

Activity: 1239
Merit: 1020

No surrender, no retreat, no regret.

Re: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner

October 06, 2023, 10:58:41 PM
Last edit: October 06, 2023, 11:15:30 PM by ghostlander

#644

ag1233, I have written only i386 and amd64 asm code with and without SSE2 support. What's in scrypt-arm.S isn't of much importance to NeoScrypt in general as Salsa20 constitutes a rather small part of it. Sure, NEON can speed things up even if compiler generated. Memory bandwidth is another question. When I checked last, 32-bit LPDDR4 powered RPi 4B couldn't reach 5GB/s on memory reads or writes. Although a quad core 1.5GHz Cortex-A72 with 1Mb L2 cache doesn't seem a poor performer, I don't think it's much faster than my old Jetson TX1. Modern high end smartphones are much better in this regard.

"If you've got a problem and have to spread some coins to make it go away, you've got no problem. You've got an expence." ~ Phoenixcoin (PXC) and Orbitcoin (ORB) and Halcyon (HAL)

JayDDee

Full Member

Offline

Activity: 1388
Merit: 220

Re: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner

October 07, 2023, 12:09:33 AM

#645

Quote from: ag1233 on September 12, 2023, 05:58:47 PM

Among the things I tried, I added the following flags
Code:

Code:

-O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations

This doesn't apply to asm, only to vectorize C code. There are flags that can be set to enable and disable ASM for various architectures.
23 bit ARM asm is likely disabled on AArch64.

Stay tuned...
https://github.com/JayDDee/cpuminer-opt/wiki/Support-for-AARCH64

cpuminer-opt developer. https://github.com/JayDDee/cpuminer-opt
https://bitcointalk.org/index.php?topic=5226770.msg53865575#msg53865575

	Author	Topic: NSGminer v0.9.4: The Fastest NeoScrypt GPU Miner (Read 221590 times)
This is a self-moderated topic. If you do not want to be moderated by the person who started this topic, create a new topic.