This is specifically a post regarding the programming of miners, but I'll share it since I think there's a good programming lesson to be learned here. Last summer I beat the
top public Monero miner by about 15% with a C# miner (ie my miner program ran about 15% faster than the top public one). If you're thinking "How the hell can you beat an optimized C miner with a C# one?" then the answer is a nice example of Knuth's "premature optimization is the root of all evil".
C is what is called a low-productivity high-speed language, C# is a high-productivity low-speed one. Productivity is how quickly/easily you can program something and speed is how fast it runs, so using C# you'd be able to quickly write a functioning program but it would run slow and C would be the opposite.
At first I quickly wrote a fully functional miner from scratch using C#, which took me about a day (C# is really high-productivity) but it was slow at about 50% of the speed of the top public one. I hadn't optimized anything yet at that point, all of the hashing functions were just basic implementations, not even optimized a little. So I ran a profiler and found that almost all the time was spent in the core mining loop.
The core loop uses AES encryption a lot for which a hardware instruction exists (aes-ni). So I programmed that part (about 100 lines of code) in C, to take advantage of the hardware instruction, and called it as a dll from my C# miner. This put me at the same speed as the top miner, even though I'd only written that specific function in C whereas the rest of my code (including the hashing functions Keccak, Blake, Groestl, etc) was still completely unoptimized C#.
I then profiled some more and found that the core loop was still by far the most time-consuming part of the code, so I looked into it some more. The loop can be split into 3 parts: explode_scratchpad, main_loop, implode_scratchpad. The middle one (main_loop) was limited by memory access latency so there was nothing that could be done there. The first and third parts (explode_scratchpad and implode_scratchpad) however weren't so I took a deeper look into them.
The algorithm for both of these constitutes keeping 8 128bit variables around and having 10 128bit keys. Then at each iteration each of the variables is AESed with each key and the result is successively written to or read from memory (a 2mb scratchpad being filled in the case of explode_scratchpad and being read in the case of implode_scratchpad). We have 8 128bit registers available (XMM0-7) and the aes-ni instruction takes a register with a variable, another register with the key, and encrypts the variable with the key.
Since we only have 8 registers the compiler-outputted assembly was constantly switching registers around to fit the keys. As it turns out there is a lesser-known variant of the aes-ni instruction which takes a register with a variable and a stack pointer to the key, apparently the compiler wasn't aware of this variant. So I wrote those 2 functions in assembly so that the 8 variables were kept in the 8 registers and the 10 keys were kept in an array on the stack and then I could iterate though the 10 keys and at each step AES the 8 registers in parallel without ever needing to switch registers. This then put me at about 115% of the speed of the top public miner.
The cool thing here is that this entire thing took me about a day, starting from scratch and ending up with a C# program with about 50 lines of handwritten assembly and about 50 lines of C thrown in as a dll. Everything else (including the Keccak, Blake, etc hashing functions) were still completely unoptimized C#. If you however take a look at the commits to the top public miner you'll see lots of things like "optimized Blake hashing" etc, ie lots of effort being put into optimizing irrelevancies, and they've been at this for years and as far as I can see they still haven't opened the Intel Instruction Set to find that there is an interesting variant of the aes-ni instruction that can be exploited to actually give a significant speed boost.
Anyway, this experience convinced me of my approach to programming, which is to first write a program using a high-productivity language and only then optimize the crucial parts/functions with a low-level programming language. This way you get the best of both worlds, high productivity for almost all of the code and high speed for the small part where it actually matters. It also stops you from putting effort into optimizing irrelevancies. Seeing as I significantly beat the top Monero miner in one day from scratch (and it was even my first time writing assembly or using hardware instruction sets) I'll consider it a vindication of my approach
Donald Knuth said:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.