It is been 6+months running keyhunt bsgs searching for 135 in random mode on 4 pc's but no luck, I wish if some one like RC or anyone searching 135 in sequential mode share his status so that we can reduce some keys.
Hi blankx4729,Thank you for the thorough debugging — ptxas spill analysis and tracing the kang_type bug down to its root cause is real engineering work. I went through your findings and the source carefully. Quick clarifications, then a redirect:
Bug 1 (GroupCnt=64 in GpuKang.cpp): not a bug in the stock build. Both GpuKang.cpp and defs.h hardcode 64 for OLD_GPU — they match. Your kang_type=0 issue appeared because you modified defs.h to PNT=16 but the host code was still 64, causing the mismatch. That said, you exposed a real maintainability issue: the constant shouldn't be duplicated. I'll fix GpuKang.cpp to read PNT_GROUP_CNT from defs.h directly, which eliminates this whole class of mistake.
Bug 2 (defs.h hardcoded for OLD_GPU): intentional design. The 64 vs 24 split reflects empirical tuning for different GPU generations (the comment in defs.h lines 115-117 documents that 200 caused catastrophic spill on modern arch). Could be made configurable via a parallel V45_PNT_GROUP_CNT_OLD knob, but it's not wrong.
KernelB constants 128 and 8: your claim that they're "for PNT=64 only" is incorrect. They come from a fixed thread tile structure (8 kangs per sub-group, 32 threads per warp), not from PNT. The math actually works for PNT ∈ {16, 24, 32, ..., 64} — any multiple of 8 that's ≥ 16. Below 16, g8_ind degenerates to 0 always, which corrupts the walks — exactly what you observed at PNT=8. I verified the algebra for PNT=16, 24, and 64; all three produce valid kang_ind reconstructions.
Your four questions, briefly:
1. Lower PNT → higher speed: correct analysis. Less spill, more active warps per SM.
2. Real or misleading? Real for PNT ≥ 16 and multiple of 8. Misleading for PNT < 16 — faster but corrupted walks.
3. PNT for puzzle 135 with limited RAM: minor misunderstanding. PNT affects GPU register pressure only. The TAME table (your -ramlimit) lives in CPU RAM — PNT doesn't touch it. Different bottlenecks.
4. KernelB generalization: doable via templating on PNT_GROUP_CNT with compile-time tile constants. Let's discuss the design on GitHub.
Your puzzle 135 stats: dropped=0 is healthy. 955 pending is normal for that throughput, not a sign of dp being too low. dp=15 is reasonable. Worth knowing though: puzzle 135 on a single 1060 is ~27,000 years expected. The math really doesn't favor 1060-class hardware — better to validate your mods against puzzle 65 or 70 where solves complete in minutes/hours.
Next steps: this is too technical for BitcoinTalk. Could you open an issue at github.com/pscamillo/PSCKangaroo with your ptxas output, exact diff, and hardware details? I'd like to ship the GpuKang.cpp fix plus a constraint comment for KernelB, and we can discuss the generalization there. PRs welcome if you want to contribute.
YOU RAN YOUR SOFWARE TEST AGAINST PUZZLES 130 OR 125?



