• Guest, before posting your code please take these rules into consideration:
    • It is required to use our BBCode feature to display your code. While within the editor click < / > or >_ and place your code within the BB Code prompt. This helps others with finding a solution by making it easier to read and easier to copy.
    • Don't share a wall of code. All we want is the problem area, the code related to your issue.


    To learn more about how to use our BBCode feature, please click here.

    Thank you, Code Forum.

How comes GCC with "-mtune=nocona" produces the fastest code? (for Zen3 CPU)

La Mancha

New Coder
Hello,

I have a program that is written in C and that is compiled with GCC on Linux, BSD and Solaris. I want the binary to run on all "x64" processors, so I use -march=x86_64 option. Also I was experimenting with different -mtune options. My machine is AMD Ryzen 9 5950X, so I would think that -mtune=znver3 should produce the fastest code for my CPU. However, an exhaustive test of all available -mtune values showed that there is hardly any difference between -mtune=generic and any of the other values. For some reason, the only exception here is -mtune=nocona, which clearly produces the fastest code!

The full result is attached to this post. The resulting runtime of the program, for each -mtune value, is given in seconds. Lower (faster) is better.

(tested with GCC version 12.1.0 on Debian Linux)

How comes that, of all things, "nocona" produces faster code? That is some old "improved version of Intel Pentium 4 CPU", very different from my Zen3 :thinking:

Thank you!
 

Attachments

  • gcc-mtune.pdf
    305.5 KB · Views: 13
Last edited:

HadASpook

King Coder
Nope, I can't help with this one. Haven't programmed in ages, meaning that I'm quite rusty with C/GCC tools.

I found your question here on StackOverflow when searching up about mtune and nocona. Appears it was closed because it "needs debugging details".

I like this user's comment(particularly the part in bold):
I do not understand what kind of answer you expect to get. How comes GCC produces fastest code with "-mtune=nocona"? Why not? Different options affect the compiler which makes different decisions, which result in a different executable which may result in a faster executable. I do not understand what other answer you would want to get, without sharing the source code and the measurement method.

They also all seem to be looking for a reproducible example, because based on that demand, nobody seems to be able to help you.
 

La Mancha

New Coder
I like this user's comment(particularly the part in bold):
I do not understand what kind of answer you expect to get. How comes GCC produces fastest code with "-mtune=nocona"? Why not? Different options affect the compiler which makes different decisions, which result in a different executable which may result in a faster executable. I do not understand what other answer you would want to get, without sharing the source code and the measurement method.
I understand that different "-mtune" options cause the compiler to make different decisions in the code optimization.

But the "-mtune" option that matches the actual CPU model is supposed to generated the fastest code. That is the whole idea of making CPU-model-specific optimizations, right? In case of my Zen 3 (Ryzen 9 5950X) CPU, this clearly would be the "-mtune=znver3" option!

Now, the really weird thing is that all "-mtune" options, which are available in GCC 12 (I have tested them all), have either zero or negligible effect, compared to the default "-mtune=generic" option. The one and only exception here is "-mtune=nocona". The latter (nocona) is the only "-mtune" option that does make a difference and produces significant faster code than all others. That is quite surprising to me o_O

(Note that "nocona" is some old Pentium 4 CPU model, very different from my Zen 3 CPU. Meanwhile, the "pentium4" option has no effect!)

I could accept that none of the optimization options makes a difference for my program. But why exactly one optimization option – and one that appears to be a totally random choice – provides a huge speed-up? This seems more like a "bug" or "undefined behavior" than a "feature".

They also all seem to be looking for a reproducible example, because based on that demand, nobody seems to be able to help you.
The "issue" was observed with a rather complex software project, which I can not put up publicly. Sorry.

Also, just picking a small "example" from the code will not show the very specific behavior that was observed with the actual (complete) program.

So, while I generally understand the demand for "reproducible example", it does not work like that here.

Appears it was closed because it "needs debugging details".
Moderators at Stack Overflow are not very supportive. Just close my topic, even though there was some helpful comment :rolleyes:

Someone had suggested I should try profile-guided optimization (-fprofile). I tried, but that didn't make much of a difference either...
 
Last edited:

HadASpook

King Coder
Well, I'm sorry but I can't be of much help to you, because I have never worked with this part of GCC.

I do agree though that the option for a completely different CPU giving you better optimisation than the one for your own CPU is quite strange. I quickly checked the GCC manual for these options and they provide no information on why this is the case, other than what CPU/architecture they are for.

I don't know if you've read over it already or if it will be of much use to you, but I found this Gentoo wiki page which provides some useful optimisation details. Hope it helps!
 
Top