If you look at it from a pedagogical point of view, yes, but if you look at it from a competitive point of view, to increase your chances, you need a huge financial investment, like RC or others. But most people here talk about top speeds, kernels, achievements, and don't publish code, which makes me think they're still stuck in the competitive trap of blind hope, and not in sharing their ideas.
Let's be honest, if you use farms, sharing fast code puts you at a disadvantage, but if you're here purely for leisure, I don't see the point in not sharing the code for those supposed "achievements".
True, very true. To increase the chance, it is either a huge financial investment for computing power, a mathematical-algorithmic advantage that still requires a huge financial investment for computing power, or one GPU luck - highly unlikely, by an order of magnitude unlikely, like throwing a dart from the moon and aiming to hit a grain of sand on Earth unlikely. But anyway, we all know that.
In my case, the code I am working on is not ready for publishing, and I am not even confident enough to publish it. No matter how many tests I run, I am not confident enough that it is good and correct. Even now, when it passes all of the tests I can think of, I am always thinking "What if that, when this, will give the expected result". So for now, it is only for personal use and development. Also, I am new to all of this. I would not even know where to begin. It would require a lot of work to make it somewhat widely usable and not architectural and CUDA version specific. What runs great on one GPU architecture does not necessarily run the best on a different one. The use of newer CUDA versions will not compile for older CUDA versions, and I am working with CUDA 13 and using features like "enable_smem_spilling" that were introduced in CUDA 13, which is useless for me since I have a compute capability 8.6 compile to 122 registers used and 0 spills while everything is fully inlined, but that is just an example of how newer CUDA features would require rewriting to make it compilable in older CUDA versions. Or I could simply publish it as "Ampere specific CUDA 13 - this is the performance". I would also need access to a lot of different GPUs, which I cannot have at the moment (financially, I cannot have them), so I am stuck with compute capability 8.6, more specifically, a 3070 Ti.
So I do not know, maybe one day. Or maybe I am just complaining too much now? You are correct on all of the things you mentioned. But I still enjoy coding, gaining knowledge, and looking at those Mkey/s numbers.