Generate instructions for the machine type cpu-type. This selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine. When used with -marchthe Pentium Pro instruction set is used, so the code runs on all i family chips. Used by Centrino notebooks. No scheduling is implemented for this chip. Tune to cpu-type everything applicable about the generated code, except for the ABI and the set of available instructions.

The choices for cpu-type are the same as for -march. In addition, -mtune supports 2 extra choices for cpu-type :. But, if you do not know exactly what CPU users of your application will have, then you should use this option. As new processors are deployed in the marketplace, the behavior of this option will change. Therefore, if you upgrade to a newer version of GCC, code generation controlled by this option will change to reflect the processors that are most common at the time that version of GCC is released.

clang mtune

In contrast, -mtune indicates the processor or, in this case, collection of processors for which the code is optimized. Produce code optimized for the most current Intel processors, which are Haswell and Silvermont for this version of GCC. But, if you want your application performs better on both Haswell and Silvermont, then you should use this option.

As new Intel processors are deployed in the marketplace, the behavior of this option will change. Therefore, if you upgrade to a newer version of GCC, code generation controlled by this option will change to reflect the most current Intel processors at the time that version of GCC is released. Generate floating-point arithmetic for selected unit unit.

The choices for unit are:. Use the standard floating-point coprocessor present on the majority of chips and emulated otherwise. Code compiled with this option runs almost everywhere. The temporary results are computed in bit precision instead of the precision specified by the type, resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description. Use scalar floating-point instructions present in the SSE instruction set.

The earlier version of the SSE instruction set supports only single-precision arithmetic, thus the double and extended-precision arithmetic are still done using A later version, present only in Pentium 4 and AMD x chips, supports double-precision arithmetic too.

For the x compiler, these extensions are enabled by default. The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of code, but may break some existing code that expects temporaries to be 80 bits.

This is the default choice for the x compiler, Darwin x targets, and the default choice for x targets with the SSE2 instruction set when -ffast-math is enabled. Attempt to utilize both instruction sets at once. This effectively doubles the amount of available registers, and on chips with separate execution units for and SSE the execution resources too.

Use this option with care, as it is still experimental, because the GCC register allocator does not model separate functional units well, resulting in unstable performance.

clang mtune

Output assembly instructions using selected dialect. Also affects which dialect is used for basic asm see Basic Asm and extended asm see Extended Asm.It also describes the theory behind optimizing in general. While these variables are not standardized, their use is essentially ubiquitous and any correctly written build should understand these for passing extra or custom options when it invokes the compiler.

See the GNU make info page for a list of some of the commonly used variables in this category. They can be used to decrease the amount of debug messages for a program, increase error warning levels and, of course, to optimize the code produced.

The GCC manual maintains a complete list of available options and their purposes. Variables set in this file will be exported to the environment of programs invoked by portage such that all packages will be compiled using these options as a base. Almost every system should be configured in this manner. Don't set them arbitrarily. Individual packages further modify these options either in the ebuild or the build system itself to generate the final set of flags used when invoking the compiler.

Being aware of the risks involved, take a look at some sane, safe optimizations. These will hold in good stead and will be endearing to developers the next time a problem is reported on Bugzilla.

Remember: aggressive flags can ruin code! Sometimes these conditions are mutually exclusive, so this guide will stick to combinations known to work well.

Compiler flags across architectures: -march, -mtune, and -mcpu

Ideally, they are the best available for any CPU architecture. For informational purposes, aggressive flag use will be covered later. Not every option listed on the GCC manual there are hundreds will be discussed, but basic, most common flags will be reviewed. The first and most important option is -march. This tells the compiler what code it should produce for the system's processor architecture or arch ; it tells GCC that it should produce code for a certain kind of CPU.

Different CPUs have different capabilities, support different instruction sets, and have different ways of executing code. The -march flag will instruct the compiler to produce specific code for the system's CPU, with all its capabilities, features, instruction sets, quirks, and so on provided the source code is prepared to use them.

For instance, to take benefit from AVX instructions, the source code needs to be adapted to support it. The reason it isn't enabled at -O2 is that it doesn't always improve code, it can make code slower as well, and usually makes the code larger; it really depends on the loop etc. To get more details, including march and mtune values, two commands can be used.

When this flag is used, GCC will attempt to detect the processor and automatically set appropriate flags for it.

However, this should not be used when intending to compile packages for different CPUs! Also available are the -mtune and -mcpu flags. These flags are normally only used when there is no available -march option; certain processor architectures may require -mtune or even -mcpu. Unfortunately, GCC's behavior isn't very consistent with how each flag behaves from one architecture to the next.

Consider using -mtune when generating code for older CPUs such as i and i Do not use -mcpu on x86 or x systems, as it is deprecated for those arches. Again, GCC's behavior and flag naming is not consistent across architectures, so be sure to check the GCC manual to determine which one should be used. Next up is the -O variable. This variable controls the overall level of optimization. Changing this value will make the code compilation take more time and will use much more memory, especially as the level of optimization is increased.From v8.

Additionally, some earlier CPUs had this property anyway so it makes sense to use it. So this extends the selection to volatile accesses, even if they're not aligned presumably the coder knows what they're doing.

The one exception for volatile is when -mstrict-align is in force, that should take precedence. Hi Tim, thanks for looking into this optimization opportunity. I have a few remarks regarding this change:. There are two problematic cases as far as I understand: 1 const and 2 volatile atomic objects. Const objects disallow write access to the underlying memory, volatile objects mandate that each byte of the underlying memory shall be accessed exactly once according to the AAPCS.

The CAS loop violates both. Maybe the solution is to follow GCC here, at least for the general case architectures prior to v8. In that case we need to provide an implementation of those functions. I think Clang is involved there too, in horribly non-obvious ways for example I think that's the only way to get the actual libcalls you want rather than legacy ones.

Either way, that's a change that would need pretty careful coordination. Since all of our CPUs are Cyclone or above we could probably just skip the libcalls entirely at Apple without ABI breakage which, unintentionally, is what this patch does. Is there such a guarantee to the compiler?

Maybe that's the default for Operating Systems but not for bare-metal? Would it make sense to enable this optimization for certain target triples, i. I don't think anyone has written down a guarantee, but we've pretty much always assumed we're accessing reasonably normal memory. I've never had any comments from our more embedded developers on that front or seen anyone try to do general atomics in another realm. I suspect they go to assembly for the few spots it might matter.

In certain code sequences where you have two consecutive atomic stores, or an atomic load followed by an atomic store you'll end up with redundant memory barriers. Is there a way to get rid of them? ARM has a pass designed to merge adjacent barriers, though I've seen it miss some cases. We might think about porting it to AArch64, or maybe doing some work in AtomicExpansion in generic way.

Enabling the corresponding subtarget feature on cyclone doesn't seem safe to me. I don't think Clang really does anything with -mtune yet. Almost all of the features in the list are going to break older CPUs.

Subscribe to RSS

I'm afraid I mentioned to Alexandros that I wondered how this would interact with a potential future enabling for an -mtune feature, leading to the above question. But you're right, if we do end up implementing support for -mtune, we'd need to categorize subtarget features into either architectural or tuning features, and only enable the tuning features for a subtarget when -mtune is used.May be specified more than once.

Flags controlling the behavior of Clang during compilation. These flags have no effect during actions that do not perform compilation. Require member pointer base types to be complete if they would be significant under the Microsoft ABI. Disable auto-generation of preprocessed source files and a script for reproduction during a clang crash. Enable ODR indicator globals to avoid false ODR violation reports in partially sanitized programs at the cost of an increase in binary size.

Strip or keep only, if negative a given number of path components when emitting check metadata. Turn on runtime checks for various forms of undefined or suspicious behavior. See user manual for available checks. Print supported cpu models for the given target if target is not specified, it will print the supported cpus for the default target.

Flags controlling how include s are resolved to files. Restrict all prior -I flags to double-quoted inclusion and remove current directory from include path. Specify the mapping of module name to precompiled module file, or load a module file if name is omitted. Flags controlling generation of a dependency file for make -like build systems.

Flags controlling which warnings, errors, and remarks Clang will generate. See the full list of warning and remark flags. Report transformation analysis from optimization passes whose name matches the given POSIX regular expression. Report missed transformations by optimization passes whose name matches the given POSIX regular expression.

Report transformations performed by optimization passes whose name matches the given POSIX regular expression. Generate instrumented code to collect context sensitive execution counts into default.

clang mtune

Form fused FP ops e. Only include passes which match a specified regular expression in the generated optimization record by default, include all passes. Generate instrumented code to collect order file into default. Generate instrumented code to collect execution counts into default. Use instrumentation data for profile-guided optimization. Allocate to an enum type only as many bytes as it needs for the declared range of possible values.

Enable stack protectors for some functions vulnerable to stack smashing. This uses a loose heuristic which considers functions vulnerable if they contain a char or 8bit integer array or constant sized calls to alloca, which are of greater size than ssp-buffer-size default: 8 bytes. All variable sized calls to alloca are considered vulnerable. Compared to -fstack-protector, this uses a stronger heuristic that includes functions containing arrays of any size and any typeas well as any calls to alloca or the taking of an address from a local variable.

Select which XRay instrumentation points to emit. Options: all, none, function, custom. OpenCL only. Specify that single precision floating-point divide and sqrt used in the program source are correctly rounded.

Defines that the global work-size be a multiple of the work-group size specified to clEnqueueNDRangeKernel. Allow unsafe floating-point optimizations.May be specified more than once. Enable builtin include directories even when -nostdinc is used before or after -ibuiltininc.

Using -nobuiltininc after the option disables it. Flags controlling the behavior of Clang during compilation. These flags have no effect during actions that do not perform compilation. Require member pointer base types to be complete if they would be significant under the Microsoft ABI.

Disable auto-generation of preprocessed source files and a script for reproduction during a clang crash. Enable ODR indicator globals to avoid false ODR violation reports in partially sanitized programs at the cost of an increase in binary size.

This option is currently unused. Strip or keep only, if negative a given number of path components when emitting check metadata. Turn on runtime checks for various forms of undefined or suspicious behavior.

See user manual for available checks. Print supported cpu models for the given target if target is not specified, it will print the supported cpus for the default target. Flags controlling how include s are resolved to files. Restrict all prior -I flags to double-quoted inclusion and remove current directory from include path. Specify the mapping of module name to precompiled module file, or load a module file if name is omitted. Flags controlling generation of a dependency file for make -like build systems.

Flags controlling which warnings, errors, and remarks Clang will generate. See the full list of warning and remark flags. Report transformation analysis from optimization passes whose name matches the given POSIX regular expression. Report missed transformations by optimization passes whose name matches the given POSIX regular expression.

Report transformations performed by optimization passes whose name matches the given POSIX regular expression. Generate instrumented code to collect context sensitive execution counts into default. Form fused FP ops e. Directly create compilation output files. This may lead to incorrect incremental builds if the compiler crashes.These flags control binary code generation, so the correct use of these flags can dramatically improve runtime performance.

What exactly do these flags do? Do they have the same meaning when compiling for Arm as when compiling for x86? Do they mean the same thing to all compilers? How should you use them to get the best performance for your application? For those compilers, the -march flag specifies the target architecture. The -mtune flag specifies the target microarchitecture. The -mtune flag does not enable the compiler to use the special hardware features of the target. It only advises the compiler to perform architecture-independent optimizations like instruction reordering.

This is a crucial difference between Arm and x86! Figure 1: Architecture vs. Microarchitecture in the Arm Ecosystem. If you plot some Arm architecture specifications e. The graph axes somewhat conflated since architectures and microarchitectures are closely linked, so the blue horizontal lines show the baseline architecture for each microarchitecture on the vertical axis. For now, just focus on the idea that each target has an architecture and a microarchitecture.

You may notice that many of the targets in a. However, if it ever did exist then we know for certain that this example a. The optimization space only shows the targets for which the compiler may have performed optimizations. This flag advises the compiler to optimize for a target microarchitecture, but only for a generic instruction set.

On Arm, if you want to optimize for both a particular architecture and microarchitecture then you use the -mcpu flag. The -mcpu flag accepts the same parameter values as the -mtune flag.

In this case, the binary could execute on anything implementing the v8. What happens when -march, -mtune, and -mcpu are used in combination? On Arm, the -march and -mtune flags override any value passed to -mcpu. Fortunately, the GNU compiler will issue a warning in this case. Another difference between Arm and x86 is that the -march and -mtune flags are entirely orthogonal on Arm.

Mix and match freely!

It is more complicated than I thought: -mtune, -march in GCC

The resulting binary will execute on architecture X and all supersets of architecture X, but will be optimized for microarchitecture Y.

The binary would have execution and optimization spaces as shown in Figure 5. So why have the -mcpu flag at all if -mcpu is just an alias for -mtune on x86, and -march and -mtune are orthogonal on Arm? Why not just combine -march and -mtune as needed on Arm, or follow the x86 convention and let -march imply -mtune?

In reality, CPU architects frequently add extensions from multiple Arm architectures to the baseline, both above and below the baseline architecture version. The Arm Neoverse N1 is a perfect example of how targets typically have a complete implementation of one architecture but support features from other architectures as well.

On the ThunderX2, any instruction from the v8. When you specify -march, you are confining the compiler to only the baseline architecture, so the compiler is unable take advantage of any architecture extensions beyond the baseline. In order to take advantage of all the features of a particular target, you should use the -mcpu flag to simultaneously specify the architecture with all its extensions, and the microarchitecture. The code is shown in Figure 6.

Arm v8. Real world application speed-ups of 10x and even x have been reported when using LSE, so if our target supports LSE then we would very much like to use LSE instructions.In C, you compile for some architecture. Thus you have to tell the compiler what kind of machine you have. In theory, you could recompile all the code for the exact machine you have, but it is slow and error prone.

So we rely on prebuilt binaries, typically, and these are often not tuned specifically for our hardware. It is possible for a program to detect the hardware it is running on and automatically adapt e. So GCC and Clang have flags that allow you to tell them what kind of hardware you have. I thought I understood them, until now. This is somewhat ambiguous and will strictly depend on the compiler version you are using, as new processors being released might change this tuning.

And that could almost be inferred from the documentation:. Let us check using funny command lines. It is important to note that this compiler predates skylake processors.

What you care about is whether it produces different binaries. Does it? Unfortunately yes, it does. See my code sample. Does it matter in practice, as far as performance goes? Probably not in actual systems, but if you are doing microbenchmarking, studying a specific function, small differences might matter.

View all posts by Daniel Lemire. If you can replicate it with gcc Maybe it depends on your operating system and GCC version. On CentOS 7.