FINAL REPORT OF THE
OPTIMIZATION PROJECT
ETA WORKSTATION FOR WEATHER PREDICTION
We report project original goals, achievements and methodology, followed by a short conclusion.
Project goal was to speed up the execution time of the Eta Workstation on Weather Forecast runs by about 20%. Project was schedule for July and August of 2004.
The goal was achieved at due time. Execution time was reduced by 22% - a forecast day requires 12 minutes and 30 seconds of CPU time on the optimized version, while requiring about 16 minutes on the original version. In other words, the original version forecasts 90 days per CPU day, while the new version forecasts 115 days per CPU day.
Numerical results were changed by the optimization process. Changes are due to specific compiler optimizations (vectorization), inhibited by the original code and unleashed by the new code. Resulting files are identical (binary) if both codes are compiled without any optimization.
Proposed work was to vectorize procedures hzadv and gsmcolumn.
There were two parts of hzadv that inhibited vectorization: the include file checkin and the temperature advection loop.
The include file avoids model “explosion”, verifying if major fields were inbounds. If not, field values were printed and computation was halted. Checking bounds and action if not inbounds were performed in the same loop. The write and stop statements of the not inbounds case prevent vectorization. The out of bounds case was removed from the loop, allowing vectorization to proceed.
The temperature advection loop visits points within a vertical level. An outmost loop varies the vertical level.
The innermost loop, in the “upstream” method, encompasses two cases:
1. All required points are outside mountains;
2. At least one required point is inside mountains.
The procedure to handle the second case generates innermost loop dependencies that cannot be removed. The first case does not generate innermost loop dependencies. Since both cases are at the same innermost loop, both cases execute in scalar mode.
A new innermost loop was included, to be executed when the vertical level does not have any mountains. An external (to the innermost loop) test decides which loop to take. Whenever there are mountains, the old loop is taken in scalar mode. In the other case, the new loop is executed in vector mode.
The Program Information bellow shows the execution time (CPU) reduction due to the modifications introduced at hzadv from the original 958 seconds to 870 seconds:
|
Original |
hzadv |
Real
Time (sec) |
1005,85 |
894,39 |
User Time (sec) |
958,92 |
870,82 |
Sys Time (sec) |
6,63 |
3,85 |
Vector Time (sec) |
515,95 |
537,61 |
Inst.
Count |
208628185271 |
174652651576 |
V.
Inst. Count |
23395831626 |
24834071241 |
V.
Element Count |
2543449664895 |
2598220833948 |
FLOP
Count |
922818382440 |
923127490314 |
MOPS |
2845,57 |
3155,71 |
MFLOPS |
962,35 |
1060,07 |
VLEN |
108,71 |
104,62 |
V.
Op. Ratio (%) |
93,21 |
94,55 |
Memory Size (MB) |
1296,03 |
1296,03 |
MIPS |
217,56 |
200,56 |
I-Cache (sec) |
26,47 |
27,26 |
O-Cache (sec) |
66,01 |
34,93 |
Bank (sec) |
22,95 |
1,21 |
Besides user time reduction, observe that vector time increases from 515 to 537 seconds.
Program output was unchanged. Forecasted fields are identical (binary) on the original and the optimized version.
Gsmcolumn is a procedure that deals with one vertical (k) column. It is invoked solely by gsmdriver, within a loop that sweeps all vertical columns within the domain (i,j). Both procedures execute mainly in scalar mode, due to dependencies in the vertical (on gsmcolumn) and to the procedure invocation within a loop (on gsmdriver). Proposed solution was to pass entire fields to gsmcolumn, allowing vectorization on the direction of multiple columns. To do so, the loop on gsmdriver has to be splitted in three parts: invocation preparation (one loop), the invocation itself (passing entire fields) and post-processing (one loop).
First step was to modify gsmdriver, passing entire fields to gsmcolumn and splitting the loop in the three parts above described. The corresponding modification on gsmcolumn was to encompass the computation in a double loop that varies i and j.
Second step was to promote scalars within gsmcolumn double loop to arrays indexed by i and j. Even then, the compiler could not vectorize the loop, due to the complex loop body (too many if statements) and invocation procedures inside the loop.
Third step was to expand inline the invoked procedures, introducing new compiler switches at makefile. The compiler was not able to expand procedures deposit and condense; remaining procedures were expanded.
Fourth step was to change loop order from i,j,k to k,j,i, attempting to ease the compiler job (since there are loop carried dependencies in the k direction and full independency in the i direction). Even then, the compiler could not vectorize, due to the same reasons: complex loop structure and procedure invocations.
Up to this point, program output was unchanged – the fields produced by the optimized version were identical to the ones produced by the original version.
For the next six steps, the inner loop was partitioned into a set of simpler loops, attempting to reduce loop complexity and to isolate procedure invocations. After these steps, all loops were vectorized, except for two loops that solely invoke procedures deposit and condense.
For the next two steps, both procedures were changed to take entire fields instead of a single column. Both procedures were vectorize.
Consequently, after twelve steps of modifications, the execution time was reduced to 747 seconds (CPU), as shown by the program information table bellow:
|
Original |
hzadv |
gsmcolumn |
Real
Time (sec) |
1005,85 |
894,39 |
768,80 |
User Time (sec) |
958,92 |
870,82 |
747,72 |
Sys Time (sec) |
6,63 |
3,85 |
3,85 |
Vector Time (sec) |
515,95 |
537,61 |
573,09 |
Inst.
Count |
208628185271 |
174652651576 |
126870532376 |
V.
Inst. Count |
23395831626 |
24834071241 |
26021411305 |
V.
Element Count |
2543449664895 |
2598220833948 |
2720659527709 |
FLOP
Count |
922818382440 |
923127490314 |
939221590299 |
MOPS |
2845,57 |
3155,71 |
3773,47 |
MFLOPS |
962,35 |
1060,07 |
1256,11 |
VLEN |
108,71 |
104,62 |
104,55 |
V.
Op. Ratio (%) |
93,21 |
94,55 |
96,43 |
Memory Size (MB) |
1296,03 |
1296,03 |
1376,03 |
MIPS |
217,56 |
200,56 |
169,68 |
I-Cache (sec) |
26,47 |
27,26 |
14,57 |
O-Cache (sec) |
66,01 |
34,93 |
27,66 |
Bank (sec) |
22,95 |
1,21 |
1,60 |
The entire procedure gsmcolumn was vectorized. The same happened with companion procedures gsmdrive, deposit and condense. The vectorization reduced total time, increasing vector time at a smaller rate. Observe the large reduction on instruction count, since many scalar instructions were traded by fewer vector instructions. Observe, also, the effect of promoting scalars to vectors in the memory size: it increased from 1296 MB to 1376 MB.
It remains to probe the modifications on
the forecasted fields due to the optimization process.
First, we have to prove that no error was introduced during the optimization process.
To do that, we compiled procedure gsmcolumn without any optimization (compiler switch –C debug) both at the original version and at the optimized version. Both versions were executed, and the forecasted fields were compared. They were identical (binary).
Consequently, code modifications did not introduce errors, and output fields modifications were solely due to compiler optimizations.
Second, we have to see how much the optimization process changed the output fields.
Picture 1 shows surface pressure for 24 hours forecasted by the original code and the optimized code. Picture 2 shows the absolute value of the difference of these fields.
We observe that differences are scattered over the domain, with a maximum of 0,2% of the original value, which is typical of round-off.
It is appropriate to compare the differences introduced by the optimizations with the differences introduced by the regular compilation of the original code. Observe that the original code compilation allow optimizations on gsmcolumn, since the original makefile has default optimization switch. Consequently, there may be differences in the forecasted fields if the default optimizations are suppressed (by selecting compiler switch –C debug).
Picture 3 shows the absolute difference of surface pressure fields of the original code compiled with default switches and without any optimization on gsmcolumn. We observe that the order of magnitude and the position of the differences are quite similar to Picture 2.
In other words, the optimization process introduced output changes that have the same order of magnitude and distribution of the changes introduced by the default optimization.
Picture 1 - Surface pressure forecasted by the original code (shaded) and by the optimized code (contour)
Picture 2 - absolute difference of Picture 1 fields
Picture 3 - Absolute difference of surface pressure on the original code, compiled with and without optimization on procedure gsmcolumn.
Project goals were achieved at due time.
The optimization process reduced execution time by about 22%. It also
introduced differences in the output fields, due to vectorization.
These differences have the same order of magnitude of those introduced by the
compiler’s code optimization phase on the original code.