qlist

You are free to download one of my perl scripts, qlist, but always remember:

What comes for free has no guarantee.

qlist can find the efficiency and performance (MFLOP/s) of loops
written in C or FORTRAN after compilation with IBM's compiler
(xlf for FORTRAN, xlc for C) using -qlist assuming no cache misses.
qlist gets the information from the .lst file that the compiler generates with -qlist.

How to use qlist on a IBM RS6000 with a 120Mhz clock:
1. Identify the subroutine/function you want to look at.
2. Put the source code in a separate file (Not necessary but recommended).
3. Compile the code with -qlist (ex.: xlf -c -qlist a32tx.f).
4. qlist a32tx 120

Example output:
%qlist a32tx 120

Loop-summary of instructions, cycles and flops in a32tx.lst

Loop CL.32 in object a32tix 
           no.    cyc.   cyc.
Instr.     of     spent  lost  flops
fma        39     13     -6       78
fnms       15      6     -1       30
fa          7      3      2        7
fs          4      1      0        4
fm          1      0      0        1
lfq         8      5      5        0
lfqu        2      1      1        0
lfd        10      2      2        0
lfdu        2      1      1        0
lfdx        7      2      2        0
lm          1      4      4        0
stfq        1      0      0        0
stfd        6      2      2        0
stfdu       1      0      0        0
cal         8      3      3        0
cax         1      0      0        0
rlinm       2      1      1        0
bl          1      0      0        0
bc          1      0      0        0
b           1      0      0        0
mtspr       2      1      1        0
mfspr       1      1      1        0
neg         1      0      0        0
------   ----  -----  -----    -----  MFLOP/s  %-eff.  Ld's  St's
Total     122     46     18      120   313.04  65.2    30     8
%
First column is the assembly instructions found in the loop.
Second column is how many times the instruction is executed during 1 iteration.
Third column is how many cycles this will take assuming no cache-misses.
Fourth column is how many cycles that are lost on this operation assuming that each cycle should produce 4 FLOP's (2 multiply-add or the like per cycle).
Fifth column is how many FLOP's the instruction will produce during 1 iteration.
The last line sums up these numbers and computes the MFLOP/s speed (assuming no cache misses), how many % this is of the theoretical peak performance of the computer (if no Mhz-rate is given 135Mhz is assumed), and many Load's and Store's are executed during one iteration.
qlist currently recognizes the following floating-point operations:
fma, fnma, fms, fnms, fa, fs, fm and fd.
If your code results in other floating point instructions, these will not add to the performance figures generated by qlist.