LLVM Project Blog: optimization

Showing posts with label optimization. Show all posts

Tuesday, June 21, 2016

ThinLTO: Scalable and Incremental LTO

ThinLTO was first introduced at EuroLLVM in 2015, with results shown from a prototype implementation within clang and LLVM. Since then, the design was reviewed through several RFCs, it has been implemented in LLVM (for gold and libLTO), and tuning is ongoing. Results already show good performance for a number of benchmarks, with compile time close to a non-LTO build.

This blog post covers the background, design, current status and usage information.

This post was written by Teresa Johnson, Mehdi Amini and David Li.

My Little LLVM: Undefined Behavior is Magic!

New LLVM logo

There’s been lots of discussion online (and then quite some more) about compilers abusing undefined behavior. As a response the LLVM compiler infrastructure is rebranding and adopting a motto to make undefined behavior friendlier and less prone to corruption.

The re-branding puts to rest a long-standing issue with LLVM’s “dragon” logo actually being a wyvern with an upside-down head, a special form of undefined behavior in its own right. The logo is now clearly a pegasus pony.

Another great side-effect of this rebranding is increased security by auto-magically closing all vulnerabilities used by the hacker who goes by the pseudonym “Pinkie Pie”.

These new features are enabled with the -rainbow clang option, in honor of Rainbow Dash’s unary name.

Loop Vectorization: Diagnostics and Control

Loop vectorization was first introduced in LLVM 3.2 and turned on by default in LLVM 3.3. It has been discussed previously on this blog in 2012 and 2013, as well as at FOSDEM 2014, and at Apple's WWDC 2013. The LLVM loop vectorizer combines multiple iterations of a loop to improve performance. Modern processors can exploit the independence of the interleaved instructions using advanced hardware features, such as multiple execution units and out-of-order execution, to improve performance.

Unfortunately, when loop vectorization is not possible or profitable the loop is silently skipped. This is a problem for many applications that rely on the performance vectorization provides. Recent updates to LLVM provide command line arguments to help diagnose vectorization issues and new a pragma syntax for tuning loop vectorization, interleaving, and unrolling.

New Feature: Diagnostics Remarks

Diagnostic remarks provide the user with an insight into the behavior of the behavior of LLVM’s optimization passes including unrolling, interleaving, and vectorization. They are enabled using the Rpass command line arguments. Interleaving and vectorization diagnostic remarks are produced by specifying the ‘loop-vectorize’ pass. For example, specifying ‘-Rpass=loop-vectorize’ tells us the following loop was vectorized by 4 and interleaved by 2.

void test1(int *List, int Length) {

int i = 0;

while(i < Length) {

List[i] = i*2;

i++;

}

clang -O3 -Rpass=loop-vectorize -S test1.c -o /dev/null

test1.c:4:5: remark:

vectorized loop (vectorization factor: 4, unrolling interleave factor: 2)

while(i < Length) {

Many loops cannot be vectorized including loops with complicated control flow, unvectorizable types, and unvectorizable calls. For example, to prove it is safe to vectorize the following loop we must prove that array ‘A’ is not an alias of array ‘B’. However, the bounds of array ‘A’ cannot be identified.

void test2(int *A, int *B, int Length) {

for (int i = 0; i < Length; i++)

A[B[i]]++;

}

clang -O3 -Rpass-analysis=loop-vectorize -S test2.c -o /dev/null

test2.c:3:5: remark:

loop not vectorized: cannot identify array bounds

for (int i = 0; i < Length; i++)

Control flow and other unvectorizable statements are reported by the '-Rpass-analysis' command line argument. For example, many uses of ‘break’ and ‘switch’ are not vectorizable.

C/C++ Code	-Rpass-analysis=loop-vectorize
for (int i = 0; i < Length; i++) { if (A[i] > 10.0) break; A[i] = 0; }	control_flow.cpp:5:9: remark: loop not vectorized: loop control flow is not understood by vectorizer if (A[i] > 10.0) ^
for (int i = 0; i < Length; i++) { switch(A[i]) { case 0: B[i] = 1; break; case 1: B[i] = 2; break; default: B[i] = 3; } }	no_switch.cpp:4:5: remark: loop not vectorized: loop contains a switch statement switch(A[i]) { ^

New Feature: Loop Pragma Directive

Explicitly control over the behavior of vectorization, interleaving and unrolling is necessary to fine tune the performance. For example, when compiling for size (-Os) it's a good idea to vectorize the hot loops of the application to improve performance. Vectorization, interleaving, and unrolling can be explicitly specified using the #pragma clang loop directive prior to any for, while, do-while, or c++11 range-based for loop. For example, the vectorization width and interleaving count is explicitly specified for the following loop using the loop pragma directive.

void test3(float *Vx, float *Vy, float *Ux, float *Uy, float *P, int Length) {

#pragma clang loop vectorize_width(4) interleave_count(4)

#pragma clang loop unroll(disable)

for (int i = 0; i < Length; i++) {

float A = Vx[i] * Ux[i];

float B = A + Vy[i] * Uy[i];

P[i] = B;

}

clang -O3 -Rpass=loop-vectorize -S test3.c -o /dev/null

test3.c:5:5: remark:

vectorized loop (vectorization factor: 4, unrolling interleave factor: 4)

for (int i = 0; i < Length; i++) {

Integer Constant Expressions

The options vectorize_width, interleave_count, and unroll_count take an integer constant expression. So it can be computed as in the example below.

template <int ArchWidth, int ExecutionUnits>

void test4(float *Vx, float *Vy, float *Ux, float *Uy, float *P, int Length) {

#pragma clang loop vectorize_width(ArchWidth)

#pragma clang loop interleave_count(ExecutionUnits * 4)

for (int i = 0; i < Length; i++) {

float A = Vx[i] * Ux[i];

float B = A + Vy[i] * Uy[i];

P[i] = B;

}

void compute_test4(float *Vx, float *Vy, float *Ux, float *Uy, float *P, int Length) {

const int arch_width = 4;

const int exec_units = 2;

test4<arch_width, exec_units>(Vx, Vy, Ux, Uy, P, Length);

}

clang -O3 -Rpass=loop-vectorize -S test4.cpp -o /dev/null

test4.cpp:6:5: remark:

vectorized loop (vectorization factor: 4, unrolling interleave factor: 8)

for (int i = 0; i < Length; i++) {

Performance Warnings

Sometimes the loop transformation is not safe to perform. For example, vectorization fails due to the use of complex control flow. If vectorization is explicitly specified a warning message is produced to alert the programmer that the directive cannot be followed. For example, the following function which returns the last positive value in the loop, cannot be vectorized because the ‘last_positive_value’ variable is used outside the loop.

int test5(int *List, int Length) {

int last_positive_index = 0;

#pragma clang loop vectorize(enable)

for (int i = 1; i < Length; i++) {

if (List[i] > 0) {

last_positive_index = i;

continue;

}

List[i] = 0;

}

return last_positive_index;

}

clang -O3 -g -S test5.c -o /dev/null

test5.c:5:9: warning:

loop not vectorized: failed explicitly specified loop vectorization

for (int i = 1; i < Length; i++) {

The debug option ‘-g’ allows the source line to be provided with the warning.

Conclusion

Diagnostic remarks and the loop pragma directive are two new features that are useful for feedback-directed-performance tuning. Special thanks to all of the people who contributed to the development of these features. Future work includes adding diagnostic remarks to the SLP vectorizer and an additional option for the loop pragma directive to declare the memory operations as safe to vectorize. Additional ideas for improvements are welcome.

Tuesday, May 28, 2013

LLVM 3.3 Vectorization Improvements

I would like to give a brief update regarding vectorization in LLVM. When LLVM 3.2 was released, it featured a new experimental loop vectorizer that was disabled by default. Since LLVM 3.2 was released, we have continued to work hard on improving vectorization, and we have some news to share. First, the loop vectorizer has new features and is now enabled by default on -O3. Second, we have a new SLP vectorizer. And finally, we have new clang command line flags to control the vectorizers.

New Loop Vectorizer

I would like to give a brief update regarding the development of the Loop Vectorizer. LLVM now has two vectorizers: The Loop Vectorizer, which operates on Loops, and the Basic Block Vectorizer, which optimizes straight-line code. These vectorizers focus on different optimization opportunities and use different techniques. The BB vectorizer merges multiple scalars that are found in the code into vectors while the Loop Vectorizer widens instructions in the original loop to operate on multiple consecutive loop iterations.

LLVM 3.1 vector changes

Intel uses the Low-Level Virtual Machine (LLVM) in a number of products, including the Intel® OpenCL SDK. The SDK's implicit vectorization module generates LLVM-IR (intermediate representation) which uses vector types.

LLVM-IR supports operations that use vector data types, and the LLVM code generator needs to do non-trivial work in order to efficiently compile vector operations into SIMD instructions. Recently, there were changes to the LLVM code generation that enabled better code generation for vector operations. In addition to many low level optimizations, this post talks about two major changes: the implementation of vector-select, and the support for vectors-of-pointers.

Greedy Register Allocation in LLVM 3.0

LLVM has two new register allocators: Basic and Greedy. When LLVM 3.0 is released, the default optimizing register allocator will no longer be linear scan, but the new greedy register allocator.
With its global live range splitting, the greedy algorithm generates code that is 1-2% smaller, and up to 10% faster than code produced by linear scan.

LLVM @ "The Architecture of Open Source Applications"

LLVM is featured in a chapter of the new book The Architecture of Open Source Applications. This chapter talks about the high-level design of LLVM, and how it differs from other contemporary compilers and JITs out there, why you might want to use it (if you're looking for compiler libraries), a simple example of writing an optimization, how the code is structured, a 10,000 foot view of how the code generator works, and some of the interesting capabilities LLVM has due to its design. If you're curious what this whole LLVM thing is, then this is a great place to start.

This book is available inexpensively in dead tree or PDF form, and the author royalties are donated to charity. The content is also available for free under the creative commons license. Share and enjoy,

-Chris Lattner

Saturday, May 21, 2011

What Every C Programmer Should Know About Undefined Behavior #3/3

In Part 1 of the series, we took a look at undefined behavior in C and showed some cases where it allows C to be more performant than "safe" languages. In Part 2, we looked at the surprising bugs this causes and some widely held misconceptions that many programmers have about C. In this article, we look at the challenges that compilers face in providing warnings about these gotchas, and talk about some of the features and tools that LLVM and Clang provide to help get the performance wins, while taking away some of the surprise.

What Every C Programmer Should Know About Undefined Behavior #2/3

In Part 1 of our series, we discussed what undefined behavior is, and how it allows C and C++ compilers to produce higher performance applications than "safe" languages. This post talks about how "unsafe" C really is, explaining some of the highly surprising effects that undefined behavior can cause. In Part #3, we talk about what friendly compilers can do to mitigate some of the surprise, even if they aren't required to.

I like to call this "Why undefined behavior is often a scary and terrible thing for C programmers". :-)

What Every C Programmer Should Know About Undefined Behavior #1/3

People occasionally ask why LLVM-compiled code sometimes generates SIGTRAP signals when the optimizer is turned on. After digging in, they find that Clang generated a "ud2" instruction (assuming X86 code) - the same as is generated by __builtin_trap(). There are several issues at work here, all centering around undefined behavior in C code and how LLVM handles it.

This blog post (the first in a series of three) tries to explain some of these issues so that you can better understand the tradeoffs and complexities involved, and perhaps learn a few more of the dark sides of C. It turns out that C is not a "high level assembler" like many experienced C programmers (particularly folks with a low-level focus) like to think, and that C++ and Objective-C have directly inherited plenty of issues from it.

Extensible Metadata in LLVM IR

A common request by front-end authors is to be able to add some sort of metadata to LLVM IR. This metadata could be used to influence language-specific optimization passes (for example, Type Based Alias Analysis in C), tag information for a custom code generator, or pass through information to link time optimization. LLVM 2.7 provides first-class support for this, and has switched debug information over to use it (improving debug info!).

While the details of this feature can be found in the LLVM Language Reference manual, sometimes it is hard to distill the big picture from the low-level details. This post tries to fill the gap by explaining some history, motivation and example use cases for this new LLVM 2.7 feature.

This post was written by Devang Patel and myself.

Address of Label and Indirect Branches in LLVM IR

The GCC Compiler supports a useful "Label as Values" extension, which allows code to take the address of a label and then later do an unconditional branch to an address specified as a void*. This extension is particularly useful for building efficient interpreters.

LLVM has long supported this extension by lowering it to a "correct" but extremely inefficient form. New in LLVM 2.7 is IR support for taking the address of a label and jumping to it later, which allows implementing this extension much more efficiently. This post describes this new LLVM IR feature and how it works.

Advanced Topics in Redundant Load Elimination with a Focus on PHI Translation

In our previous post on GVN we introduced some basics of load elimination. This post describes some advanced topics and focuses on PHI translation: what it is, why it is important, shows some nice things it can do, and describes the implementation in LLVM.

Introduction to load elimination in the GVN pass

One very important optimization that the GVN pass (opt -gvn) does is load elimination. Load elimination involves several subsystems (including alias analysis, memory dependence analysis, SSA construction, PHI translation) and has many facets (full vs partial redundancy elimination, value coercion, handling memset/memcpy, etc). In this post, I introduce and motivate the topic, which will let us expand on it in future posts.

May	JUN	Jul
	18
2016	2017	2018

LLVM Project Blog

Tuesday, June 21, 2016

ThinLTO: Scalable and Incremental LTO

Friday, April 1, 2016

My Little LLVM: Undefined Behavior is Magic!

Monday, November 24, 2014

Loop Vectorization: Diagnostics and Control

New Feature: Diagnostics Remarks

New Feature: Loop Pragma Directive

Integer Constant Expressions

Performance Warnings

Conclusion

Tuesday, May 28, 2013

LLVM 3.3 Vectorization Improvements

Friday, December 7, 2012

New Loop Vectorizer

Monday, December 19, 2011

LLVM 3.1 vector changes

Sunday, September 18, 2011

Greedy Register Allocation in LLVM 3.0

Sunday, May 29, 2011

LLVM @ "The Architecture of Open Source Applications"

Saturday, May 21, 2011

What Every C Programmer Should Know About Undefined Behavior #3/3

Saturday, May 14, 2011

What Every C Programmer Should Know About Undefined Behavior #2/3

Friday, May 13, 2011

What Every C Programmer Should Know About Undefined Behavior #1/3

Wednesday, April 14, 2010

Extensible Metadata in LLVM IR

Sunday, January 3, 2010

Address of Label and Indirect Branches in LLVM IR

Saturday, December 19, 2009

Advanced Topics in Redundant Load Elimination with a Focus on PHI Translation

Thursday, December 17, 2009

Introduction to load elimination in the GVN pass

LLVM Logo

Labels

Blog Archive

About The LLVM Blog