Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!
  • Guest, before posting your code please take these rules into consideration:
    • It is required to use our BBCode feature to display your code. While within the editor click < / > or >_ and place your code within the BB Code prompt. This helps others with finding a solution by making it easier to read and easier to copy.
    • You can also use markdown to share your code. When using markdown your code will be automatically converted to BBCode. For help with markdown check out the markdown guide.
    • Don't share a wall of code. All we want is the problem area, the code related to your issue.


    To learn more about how to use our BBCode feature, please click here.

    Thank you, Code Forum.

How comes GCC with "-mtune=nocona" produces the fastest code? (for Zen3 CPU)

La Mancha

New Coder
Hello,

I have a program that is written in C and that is compiled with GCC on Linux, BSD and Solaris. I want the binary to run on all "x64" processors, so I use -march=x86_64 option. Also I was experimenting with different -mtune options. My machine is AMD Ryzen 9 5950X, so I would think that -mtune=znver3 should produce the fastest code for my CPU. However, an exhaustive test of all available -mtune values showed that there is hardly any difference between -mtune=generic and any of the other values. For some reason, the only exception here is -mtune=nocona, which clearly produces the fastest code!

The full result is attached to this post. The resulting runtime of the program, for each -mtune value, is given in seconds. Lower (faster) is better.

(tested with GCC version 12.1.0 on Debian Linux)

How comes that, of all things, "nocona" produces faster code? That is some old "improved version of Intel Pentium 4 CPU", very different from my Zen3 :thinking:

Thank you!
 

Attachments

  • gcc-mtune.pdf
    305.5 KB · Views: 14
Last edited:
Nope, I can't help with this one. Haven't programmed in ages, meaning that I'm quite rusty with C/GCC tools.

I found your question here on StackOverflow when searching up about mtune and nocona. Appears it was closed because it "needs debugging details".

I like this user's comment(particularly the part in bold):
I do not understand what kind of answer you expect to get. How comes GCC produces fastest code with "-mtune=nocona"? Why not? Different options affect the compiler which makes different decisions, which result in a different executable which may result in a faster executable. I do not understand what other answer you would want to get, without sharing the source code and the measurement method.

They also all seem to be looking for a reproducible example, because based on that demand, nobody seems to be able to help you.
 
I like this user's comment(particularly the part in bold):
I do not understand what kind of answer you expect to get. How comes GCC produces fastest code with "-mtune=nocona"? Why not? Different options affect the compiler which makes different decisions, which result in a different executable which may result in a faster executable. I do not understand what other answer you would want to get, without sharing the source code and the measurement method.
I understand that different "-mtune" options cause the compiler to make different decisions in the code optimization.

But the "-mtune" option that matches the actual CPU model is supposed to generated the fastest code. That is the whole idea of making CPU-model-specific optimizations, right? In case of my Zen 3 (Ryzen 9 5950X) CPU, this clearly would be the "-mtune=znver3" option!

Now, the really weird thing is that all "-mtune" options, which are available in GCC 12 (I have tested them all), have either zero or negligible effect, compared to the default "-mtune=generic" option. The one and only exception here is "-mtune=nocona". The latter (nocona) is the only "-mtune" option that does make a difference and produces significant faster code than all others. That is quite surprising to me o_O

(Note that "nocona" is some old Pentium 4 CPU model, very different from my Zen 3 CPU. Meanwhile, the "pentium4" option has no effect!)

I could accept that none of the optimization options makes a difference for my program. But why exactly one optimization option – and one that appears to be a totally random choice – provides a huge speed-up? This seems more like a "bug" or "undefined behavior" than a "feature".

They also all seem to be looking for a reproducible example, because based on that demand, nobody seems to be able to help you.
The "issue" was observed with a rather complex software project, which I can not put up publicly. Sorry.

Also, just picking a small "example" from the code will not show the very specific behavior that was observed with the actual (complete) program.

So, while I generally understand the demand for "reproducible example", it does not work like that here.

Appears it was closed because it "needs debugging details".
Moderators at Stack Overflow are not very supportive. Just close my topic, even though there was some helpful comment :rolleyes:

Someone had suggested I should try profile-guided optimization (-fprofile). I tried, but that didn't make much of a difference either...
 
Last edited:
Well, I'm sorry but I can't be of much help to you, because I have never worked with this part of GCC.

I do agree though that the option for a completely different CPU giving you better optimisation than the one for your own CPU is quite strange. I quickly checked the GCC manual for these options and they provide no information on why this is the case, other than what CPU/architecture they are for.

I don't know if you've read over it already or if it will be of much use to you, but I found this Gentoo wiki page which provides some useful optimisation details. Hope it helps!
 
Not even close to providing correct answer to this question 🤪 but I would guess that answer would fall into either of two generic categories:
1. Accident (circumstance, relative) - either lucky, for example (developer being " in the zone") or less lucky (found a way for CPU to ignore its thermal constrains).
2. Tradeoff - either good, like your code got squeezed in particular ways to run on that particular CPU and it turns out its a perfect fit and even more, one that your actual CPU can use to process more of it at a same time (occasionally less is more also for CPU extensions) so you basically lucked out on what you had to give up as it wasn't there in the first place. Finally, example of bad tradeoff would be reintroducing old security hole (those often happen to make stuff faster).

Positive scenario has potential to make modern computers run even faster! and is also probably not very likely.
 
Last edited:
Ps. Happen quite frequently actually, my GPU got a huge performance boost when I set it as incorrect (older) generation card, and my CPU often lie it is different product in different architecture even, I assume as cost saving measure - had no bad side effect, but I assume processor optimization could misbehave because of that.

PPs. If I remember correctly (it was decades ago) CPU optimizations decreasing performance instead of increasing it as it should, was a reason why I stopped to bother with them. A word fugazi comes to mind :) outside few specific extensions maybe.
 
Last edited:

New Threads

Latest posts

Buy us a coffee!

Back
Top Bottom