Test what kind of loops the Intel compiler can vectorize.
To test vectorization, we are going to write our own function.
Let's call it MyFunction,
the name really doesn't matter.
Pretend that it is accepting
a single input in the integer n and what we are going to do with this input is,
we will declare two arrays A of size n and B of
size n. I will initialize these arrays using array invocation,
which is C++ language extension and that's too plain for compilers,
A starting from zero n element,
equal to B starting from zero n element equal to
one and I will return A of two.
I have to return something to make sure that
this entire code is not eliminated by the compiler through dead code elimination.
To compile this code, I will execute a shell command into a C++ compiler.
Instead of compiling into an object file,
I will produce assembly with -S.
I will request an optimization report
qopt report and then I indicate the source code name.
The command has succeeded and it says that
optimization reports are generated in opt report files in the output location.
Let's see what new files we have.
So in addition to the code,
now we have the optimization report and the assembly listing in the dot s file.
The optimization report is just a text file and what
you find here is that the loop in line three was vectorized.
Furthermore, there was a reminder loop which was not vectorised and the remainder loop is
necessary if the length of the rays,
the value of n, is not a multiple of the vector length.
The assembly in worker.s will tell us what exactly is happening.
So for line three,
this is the beginning of the loop,
testing the loop exit condition and setting up and here
is the operation mov is a vector load instruction.
In this case, this is like a store instruction and it uses the registers xmm,
xmm0 specifically to store data into memory.
Ps stands for packed single so this is a packed vector of single precision numbers.
Well, how long is this vector?
Because this is an xmm register,
we know that the length of the vector is 128 bits.
These are legacy SSE instructions.
If we wanted to compile the code for xeon Phi,
I would have to include one more compiler argument
-xMIC-AVX512 with this command,
the optimization report still says that the loop is vectorised, in fact,
the reminder is now vectorized and
the assembly will show that we are using ZMM registers.
So this is a vector store operation.
It's online version for a vector of
packed double precision numbers and ZMM registers are 512 bits long.
These are from the AVX-512 instructions at the,
instructions had included with second generation Intel Xeon Phi processors.
Okay. So, this was a trivial loop,
now let's set up a loop that does something more interesting.
So first a+b, let's see how that vectorizes,
that shouldn't be a problem at all.
Indeed, the compiler report says that we are vectorized and assembly for
line six is using
vector loads and vector stores and this is the vector addition operation,
vector add packs double,
operating on ZMM registers from the size of the registers and from the data type,
we can conclude that we are adding eight double precision numbers at a time.
Well, how about something less trivial?
For example, here I will redefine B of i,
I will multiply it by
the double representation of the index i, will that vectorize?
Well, we're running the compilation command.
The the optimization report says it's
vectorized and the assembly listing should also show that,
but, this time we are going to see vector broadcast your vector conversion.
This is a conversion from a vector of
integers to a vector of packed double precision numbers and this is from
a YMM register which is 256 bit long to a ZMM register.
Why is it YMM?
Well, that's because our integers are 32 bit integers
and eight of them will only fill 256 bytes,
ah bits, again type conversion,
and take a look at this,
this is very interesting,
this is a fused multiply and add operation in double precision numbers.
So what's happening here is instead of
doing the multiplication and the addition separately,
the compiler recognize that we have
a more complex pattern where we have multiplications followed by additions.
So, what's happening here is to compute a of I,
the compiler use multiplication and addition as a single operation.
Okay. So, how about a more complex pattern of vectorization?
Maybe, we are going to introduce a stride, for example,
if we go in i from zero to
n divided by two
and multiply the index by two in the right hand side, will this vectorise?
So I'm recompiling. The optimisation report says that the loop was vectorized,
but in a the minute I will show you how to detect a red flag for this vectorization.
The assembly for line six,
as you can see there's now some new instructions,
this is a vector permutation,
followed by vector addition.
So this is still processed with vector instructions,
but if we compile the code with
a more verbose optimization report by setting qopt report equal to five,
we will find that the loop in line five is vectorized.
But there is a remark about non-unit stride load of variable b in line six.
These non-unit strides loads or non-unit stores,
usually they are less efficient in terms of performance than unit stride loads.
So, we will talk later about tuning data containers
and loop patterns in such a way that you avoid strided access and maintain unit stride.
Something else that I wanted to demonstrate is branches inside loops.
So, if i modulo three is zero.
So, only for every third value of
i do i want to do this computation, will that vectorize?
Now let's see, recompiling,
re-reading the vectorization report and for line five,
for line five we find that the loop is vectorized.
There are only unit stride loads here,
masked aligned the stride load and this masking indicates
that not all results that affect the calculation are stored but only some of them.
Well, how would you compute this with a vector?
What the compiler usually does it lumps together eight values of i.
It performs the addition operation for all values of
i and then separately it computes a mask,
and using this mask only some of the results are stored.
So, we are going to do
three times the amount of arithmetics that the scalar loop would do.
But this is okay because we get a speed up by a factor of eight due to vectorization.
At some point, if the branches are not taken very often,
you should think of maybe prohibiting vectorization
because with vectorisation you will... might have to
do a lot of mathematics that you discard.
To prohibit vectorization, you can use pragma novector
and it might actually result
in a performance increase if you take these branches very rarely.
Finally, let me demonstrate calling functions from vector loops.
If we include the header for the standard math library,
I will be able to do things like exponential of b,
and let's see how that works and recompiling,
the loop is hopefully is still vectorized, yes,
looks like it is in assembly,
let's see what we find for line eight.
This is a vector load and this is a call to
a short vector math library for the exp function
with a width of 8 SVML (stands
for Short Vector Math Library) and this is the library that
connects your transcendental functions in code to the underlying architecture.
These functions can be used with vector inputs and this is what happened here.
Sometimes these functions will use analytical approximations,
sometimes they will use built in instructions in your instruction set.
It depends on the accuracy that your code is required to
produce and it also depends on the nature of the function, for example,
if here I use one over the square root,
then instead of a short vector math library function,
we will see an assembly instruction calling
the AVX512 ER instruction for that reciprocal square.
Recompile, check the assembly and for line eight we have
this vector reciprocals square root
with 28 being precision on packed double precision numbers.
Now let's see what is not going to vectorize.
Back to a plus b.
Let's do a harmless looking change,
a of i was equal to b of i minus one,
but to make it more fun let's make it a of five minus one.
This is actually the kernel of
the Fibonacci number calculation and this kernel
says that I have to know a of i minus one before I can compute a of i,
so let's see how this compiles.
When I compile I will see
that the loop in line seven is not vectorized because of a vector dependence.
The dependence is between a of i and a of i minus one and if I look in
the assembly then for line for line eight I will find scalar instructions.
And this is a scalar addition, it's actually,
it has the prefix V because it is going to use the vector processing units,
but the addition is on a single double precision number.
So, this is not possible to get vectorize because it has vector dependence,
and the compiler recognizes it,
and produces code that will give you correct result.
Now, the final step and this I will allow you to study at home,
I will change minus one to plus one,
recompile, check the optimization report,
and for looping line seven,
we find that the loop is vectorized.
So, here's a challenge for you.
If I have minus one,
it does not vectorise because it's unsafe.
If it has plus one it does vectorize.
So at home, think about it,
is it safe to vectorise this kernel?
What it looks like the vectors and see if this compiler decision is justified.