I'm dusting off an old training paper I wrote some years back, still applicable and a bit of a change in pace. Enjoy.
Brief
Profiling is a mechanism for
identifying program 'hotspots'; areas where your program spends a
good portion of it's time. These regions are prime candidates for
optimizations, where concentration on efficiency will give you the
most bang for your buck. Gprof is a GNU profiling tool available for
a host of platforms and tightly integrated with the ever-so-popular
Gcc compiler. This paper will walk through some basic information to
utilize Gprof to identify program hotspots, optimize these regions
and with-luck note the performance gains.
Introduction
Delivery
of high-performance applications requires a critical step of
evaluating the performance of the application. While care in design
and implementation with concentration on optimization is essential in
delivering high-performance applications profiling the source after
implementation may identify regions where code optimization may show
the greatest benefits. This paper will demonstrate the procedure for
profiling an application, identify program 'hotspots' and optimize
these hotspots with intent to show performance gains. The example
application is an example produced for an independent course with no
artificially introduced optimization candidates. With luck, at
papers end the application will be optimized and demonstrate a
performance gain.
Discussion
Profiling
with Gprof consists of 3 phases: 1) compiling your application with
profiling enabled, 2) executing the application to gather profiling
metrics, and 3) evaluation of the collected metrics.
Compiling
Enabling
profiling requires compiling your application with Gcc much as you'd
normally do. The difference however is the introduction of two flags
that you may not normally provide (-pg and -g). The
-pg option enables profiling, the -g option introduces
debugging symbols. The -g option is not too uncommon, but has
particular significance when profiling as it provides the required
debugging symbols for line-by-line profiling. The absence of the -g
flag will result in the profiler gathering and reporting measures on
a per function basis. Since we are interested in profiling on a
line-by-line basis, we will provide both flags; as follows:
$
gcc -std=c99 -g -pg demo.c -ltheora -lm -o demo
Execution
Execution
of your application takes the same form as your normally run it. You
should specify the same command line arguments, your inputs and
outputs are identical as that while running without profiling
enabled. The only possibly observed difference is that your program
may run slower as it acquires and tallies your profiling measures.
It is worth noting that your application must terminate in a normal
fashion, to include returning from the main() function, or via
calling exit(), abrupt termination in a non-typical fashion
will interrupt profiling and result in no profiling information
collected.
The
normal termination of your program will result in the profiler
generating an output file (gmon.out) just prior to exiting.
The location of this output file is generally in the current working
directory, however if your application makes a habit of changing
directories you may file this output file in the present working
directory of your application prior to terminating. The absence of
this output file hints at the lack of providing the -pg option
during compilation, or the possibility of your application
terminating in a non-normal manner.
$
./demo
Evaluation
Now
that your application has been compiled with profiling enabled,
executed your application in a nominal fashion, and profiling metrics
have been collected. The final step in profiling is the evaluation
of the collected metrics. This is explicitly where the gprof
utility is used. The default report style is done simply by
executing gprof while specifying the executable as follows:
$
gprof demo
This
command will evaluate the profiler metrics and generate a default
profiler report. We have yet to see the example application source,
as it becomes relevant it is listed as follows:
1
// example used for Theora quick survey paper
2
#include <stdio.h>
3
#include <stdlib.h>
4
#include <ogg/ogg.h>
5
#include <theora/theora.h>
6
#include <string.h>
7
#include <math.h>
8
#include <assert.h>
9
10
// this method encodes the specified frame and writes to the ogg
file
11
void writeFrame( FILE *fp, theora_state *theoraState,
ogg_stream_state
*ogg_os,
12
unsigned w, unsigned h, unsigned char *yuv) {
13
// write the frame
14
yuv_buffer yuv_buf;
15
ogg_packet op;
16
ogg_page og;
17
18
unsigned long yuv_w;
19
unsigned long yuv_h;
20
21
unsigned char *yuv_y;
22
unsigned char *yuv_u;
23
unsigned char *yuv_v;
24
25
unsigned int x;
26
unsigned int y;
27
28
/* Must hold: yuv_w >= w */
29
yuv_w = (w + 15) & ~15;
30
31
/* Must hold: yuv_h >= h */
32
yuv_h = (h + 15) & ~15;
33
34
yuv_y = (unsigned char*)malloc(yuv_w * yuv_h);
35
yuv_u = (unsigned char*)malloc(yuv_w * yuv_h / 4);
36
yuv_v = (unsigned char*)malloc(yuv_w * yuv_h / 4);
37
yuv_buf.y_width = yuv_w;
38
yuv_buf.y_height = yuv_h;
39
yuv_buf.y_stride = yuv_w;
40
yuv_buf.uv_width = yuv_w >> 1;
41
yuv_buf.uv_height = yuv_h >> 1;
42
yuv_buf.uv_stride = yuv_w >> 1;
43
yuv_buf.y = yuv_y;
44
yuv_buf.u = yuv_u;
45
yuv_buf.v = yuv_v;
46
47
for(y = 0; y < yuv_h; y++) {
48
for(x = 0; x < yuv_w; x++) {
49
yuv_y[x + y * yuv_w] = 0;
50
}
51
}
52
for(y = 0; y < yuv_h; y += 2) {
53
for(x = 0; x < yuv_w; x += 2) {
54
yuv_u[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
0;
55
yuv_v[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
0;
56
}
57
}
58
59
for(y = 0; y < h; y++) {
60
for(x = 0; x < w; x++) {
61
yuv_y[x + y * yuv_w] = yuv[3 * (x + y * w) + 0];
62
}
63
}
64
65
for(y = 0; y < h; y += 2) {
66
for(x = 0; x < w; x += 2) {
67
yuv_u[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
68
yuv[3 * (x + y * w) + 1];
69
yuv_v[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
70
yuv[3 * (x + y * w) + 2];
71
}
72
}
73
74
assert(0 == theora_encode_YUVin(theoraState, &yuv_buf));
75
assert(0 != theora_encode_packetout(theoraState, 0, &op));
76
77
ogg_stream_packetin(ogg_os, &op);
78
if(ogg_stream_pageout(ogg_os, &og)) {
79
fwrite(og.header, og.header_len, 1, fp);
80
fwrite(og.body, og.body_len, 1, fp);
81
}
82
83
free(yuv_y);
84
free(yuv_u);
85
free(yuv_v);
86
}
87
88
int main() {
89
// frame dimensions
90
const unsigned Width = 200;
91
const unsigned Height = 200;
92
// frames per second
93
const unsigned Fps = 10;
94
// video quality 0-63 (63-highest quality)
95
const int VideoQuality = 63;
96
const unsigned NumFrames = 30 * Fps; // 30-sec video
97
98
// this is the codec's internal state and context
99
theora_state theoraState;
100
101
// Info Header
102
theora_info encoderInfo;
103
theora_info_init(&encoderInfo);
104
105
encoderInfo.width = ((Width + 15) >>4)<<4; //
encoder frame must
//
be defined
106
encoderInfo.height = ((Height + 15)>>4)<<4; // in
multiples of 16
107
encoderInfo.frame_width = Width;
108
encoderInfo.frame_height = Height;
109
encoderInfo.offset_x = 0;
110
encoderInfo.offset_y = 0;
111
encoderInfo.fps_numerator = Fps;
112
encoderInfo.fps_denominator = 1;
113
encoderInfo.aspect_numerator = Width;
114
encoderInfo.aspect_denominator = Height;
115
encoderInfo.colorspace = OC_CS_UNSPECIFIED;
116
encoderInfo.pixelformat = OC_PF_420;
117
encoderInfo.target_bitrate = 0;
118
encoderInfo.quality = VideoQuality;
119
encoderInfo.dropframes_p = 0;
120
encoderInfo.quick_p = 1;
121
encoderInfo.keyframe_auto_p = 1;
122
encoderInfo.keyframe_frequency = 64;
123
encoderInfo.keyframe_frequency_force = 64;
124
encoderInfo.keyframe_data_target_bitrate = 0;
125
encoderInfo.keyframe_mindistance = 8;
126
encoderInfo.noise_sensitivity = 1;
127
128
// initialize the codecs state
129
assert(0 == theora_encode_init(&theoraState, &encoderInfo));
130
131
// Comment Header
132
theora_comment tc;
133
theora_comment_init(&tc);
134
theora_comment_add(&tc,"AUTHOR=me");
135
theora_comment_add(&tc,"BRIEF=demo generated file");
136
137
FILE *fp = fopen("demo.ogg", "wb");
138
{
139
ogg_stream_state ogg_os;
140
ogg_packet op;
141
ogg_page og;
142
143
assert(0 == ogg_stream_init(&ogg_os, rand()));
144
145
// encode info header in data packet
146
theora_encode_header(&theoraState, &op);
147
ogg_stream_packetin(&ogg_os, &op);
148
if(ogg_stream_pageout(&ogg_os, &og)) {
149
fwrite(og.header, og.header_len, 1, fp);
150
fwrite(og.body, og.body_len, 1, fp);
151
}
152
153
// encode comment header in data packet
154
theora_encode_comment(&tc, &op);
155
ogg_stream_packetin(&ogg_os, &op);
156
if(ogg_stream_pageout(&ogg_os, &og)) {
157
fwrite(og.header, og.header_len, 1, fp);
158
fwrite(og.body, og.body_len, 1, fp);
159
}
160
161
// now encode the tables
162
theora_encode_tables(&theoraState, &op);
163
ogg_stream_packetin(&ogg_os, &op);
164
if(ogg_stream_pageout(&ogg_os, &og)) {
165
fwrite(og.header, og.header_len, 1, fp);
166
fwrite(og.body, og.body_len, 1, fp);
167
}
168
169
// flush all headers to packet
170
if(ogg_stream_flush(&ogg_os, &og)) {
171
fwrite(og.header, og.header_len, 1, fp);
172
fwrite(og.body, og.body_len, 1, fp);
173
}
174
175
for (int i=0; i<NumFrames; ++i) {
176
unsigned char yuv[Width*Height*3];
177
memset(yuv,0,Width*Height*3);
178
const double R = 255.0;
179
const double G = 255.0;
180
const double B = 255.0;
181
const double dY = (0.257 * R) + (0.504 * G) + (0.098 * B) +
16;
182
const double dU = -(0.148 * R) - (0.291 * G) + (0.439 * B) +
128;
183
const double dV = (0.439 * R) - (0.368 * G) - (0.071 * B) +
128;
184
const unsigned char Y = (unsigned char)dY;
185
const unsigned char U = (unsigned char)dU;
186
const unsigned char V = (unsigned char)dV;
187
188
// generate the frame background with specified color
189
{
190
unsigned i;
191
for(i = 0; i < Width*Height*3; i+=3) {
192
yuv[i] = Y;
193
yuv[i+1] = U;
194
yuv[i+2] = V;
195
}
196
}
197
198
// -- generate spinning dot of a foreground color --
199
{
200
const double R = 0.0;
201
const double G = 0.0;
202
const double B = 0.0;
203
const double dY = (0.257 * R) + (0.504 * G) + (0.098 * B) +
16;
204
const double dU = -(0.148 * R) - (0.291 * G) + (0.439 * B) +
128;
205
const double dV = (0.439 * R) - (0.368 * G) - (0.071 * B) +
128;
206
const unsigned char Y = (unsigned char)dY;
207
const unsigned char U = (unsigned char)dU;
208
const unsigned char V = (unsigned char)dV;
209
const unsigned cX = Width/2;
210
const unsigned cY = Height/2;
211
const unsigned Radius = 50;
212
static double theta = 0.0;
213
const unsigned x = (unsigned)(Radius * sin(theta) + cX);
214
const unsigned y = (unsigned)(Radius * cos(theta) + cY);
215
const unsigned k=3*(x + (Width*y));
216
theta -= 5.0 * 3.14159/180.0;
217
yuv[k] = Y;
218
yuv[k+1] = G;
219
yuv[k+2] = B;
220
}
221
222
writeFrame(fp,&theoraState,&ogg_os,Width, Height,
yuv);
223
}
224
225
// prepare and close up the packet
226
theora_encode_packetout(&theoraState, 1, &op);
227
if(ogg_stream_pageout(&ogg_os, &og)) {
228
fwrite(og.header, og.header_len, 1, fp);
229
fwrite(og.body, og.body_len, 1, fp);
230
}
231
232
theora_info_clear(&encoderInfo);
233
theora_clear(&theoraState);
234
235
fflush(fp);
236
fclose(fp);
237
}
238
return EXIT_SUCCESS;
239
}
Since
we are interested in line-by-line profiling, we will forgo the
default report and generate a line-by-line profile report as follows:
$
gprof -l -b demo
The
-b argument is to surpress some default output and generate a
brief report; the output is as follows:
Flat
profile:
Each
sample counts as 0.01 seconds.
%
cumulative self self total
time
seconds seconds calls Ts/call Ts/call name
17.81
0.26 0.26 writeFrame (demo.c:61
@ 8048d54)
13.36
0.46 0.20 main (demo.c:191 @
80495d5)
13.01
0.65 0.19 main (demo.c:177 @
804943b)
11.64
0.81 0.17 writeFrame (demo.c:48
@ 8048cc4)
9.59
0.95 0.14 writeFrame (demo.c:60
@ 8048d7d)
6.16
1.04 0.09 writeFrame (demo.c:67
@ 8048da8)
5.48
1.12 0.08 main (demo.c:194 @
80495c2)
5.48
1.21 0.08 writeFrame (demo.c:69
@ 8048de1)
4.11
1.26 0.06 main (demo.c:193 @
80495b1)
3.42
1.31 0.05 main (demo.c:192 @
80495a1)
3.42
1.36 0.05 writeFrame (demo.c:55
@ 8048d0b)
3.42
1.42 0.05 writeFrame (demo.c:53
@ 8048d2a)
1.37
1.44 0.02 writeFrame (demo.c:54
@ 8048cec)
0.68
1.45 0.01 writeFrame (demo.c:49
@ 8048cb4)
0.68
1.46 0.01 writeFrame (demo.c:52
@ 8048d36)
0.34
1.46 0.01 main (demo.c:200 @
80495f3)
0.00
1.46 0.00 300 0.00 0.00 writeFrame (demo.c:12
@ 8048c04)
Call
graph
granularity:
each sample hit covers 4 byte(s) for 0.68% of 1.46 seconds
index
% time self children called name
0.00
0.00 300/300 main (demo.c:222 @ 8049892) [121]
[17]
0.0 0.00 0.00 300 writeFrame (demo.c:12 @
8048c04) [17]
-----------------------------------------------
Index
by function name
[3]
main (demo.c:177 @ 804943b) [17] writeFrame (demo.c:12 @ 8048c04)
[15] writeFrame (demo.c:52 @ 8048d36)
[10]
main (demo.c:192 @ 80495a1) [14] writeFrame (demo.c:49 @ 8048cb4) [1]
writeFrame (demo.c:61 @ 8048d54)
[9]
main (demo.c:193 @ 80495b1) [4] writeFrame (demo.c:48 @ 8048cc4) [5]
writeFrame (demo.c:60 @ 8048d7d)
[7]
main (demo.c:194 @ 80495c2) [13] writeFrame (demo.c:54 @ 8048cec) [6]
writeFrame (demo.c:67 @ 8048da8)
[2]
main (demo.c:191 @ 80495d5) [11] writeFrame (demo.c:55 @ 8048d0b) [8]
writeFrame (demo.c:69 @ 8048de1)
[16]
main (demo.c:200 @ 80495f3) [12] writeFrame (demo.c:53 @ 8048d2a)
The
report summarizes a good deal of information about the application
run but doesn't give a total execution time to determine overall
performance gains. We can time the execution of the application for
a few samples to calculate baseline performance statistics.
$
time ./demo
A
series of 3 runs shows an average real application run-time of 1.6553
seconds; we'll define this as our base performance metric.
The
report is ordered with respect to decreasing run time spent on each
line. If we focus our attention on the heavy-hitters in decreasing
order we'll focus our attention in one the following:
Run-Time
Percentage |
Line |
17.81 |
61 |
13.36 |
191 |
13.01 |
177 |
11.64 |
48 |
9.59 |
60 |
6.16 |
67 |
5.48 |
194 |
5.48 |
69 |
4.11 |
193 |
3.42 |
192 |
3.42 |
55 |
3.42 |
53 |
1.37 |
54 |
0.68 |
49 |
0.68 |
52 |
0.34 |
200 |
Working
our way down the candidate list we find lines 61, 191, 177 and so on
we find optimization of these lines to be of a tough breed. Lines
48-49 and 52 however can readily be replaced with calls to memset.
Lines 47-57 become
Timing
our application once again shows an average of 1.481 seconds, showing
a performance improvement of approximately 11%.
The
default application encodes 300 frames, resulting in profiler metrics
being collected for 300 calls. Profiler sample counts of 0.01
seconds implies that cumulative profiler measurements less than the
sample count cannot be trusted. Increasing the number of iterations
allows for cumulative results that exceed this sample count value and
therefore increases our candidate optimization regions. Increasing
the iterations to 30000 and rerunning profiler collection show a
significantly different report on our updated source code.
Flat
profile:
Each
sample counts as 0.01 seconds.
%
cumulative self self total
time
seconds seconds calls ns/call ns/call name
16.77
18.20 18.20 writeFrame (demo.c:53
@ 8048cfa)
16.25
35.84 17.64 main (demo.c:183 @
804957e)
14.96
52.08 16.24 main (demo.c:169 @
80493e4)
11.15
64.18 12.11 writeFrame (demo.c:52
@ 8048d23)
7.59
72.42 8.24 writeFrame (demo.c:61
@ 8048d87)
6.84
79.84 7.42 writeFrame (demo.c:59
@ 8048d4e)
6.22
86.59 6.75 main (demo.c:186 @
804956b)
4.99
92.02 5.42 writeFrame (demo.c:47
@ 8048ca3)
4.95
97.39 5.38 main (demo.c:185 @
804955a)
2.47
100.07 2.68 main (demo.c:184 @
804954a)
2.31
102.58 2.50 writeFrame (demo.c:58
@ 8048dc2)
1.32
104.00 1.43 writeFrame (demo.c:48
@ 8048cb8)
1.13
105.23 1.23 writeFrame (demo.c:49
@ 8048cd0)
0.82
106.12 0.89 writeFrame (demo.c:51
@ 8048d2e)
0.79
106.98 0.86 main (demo.c:183 @
8049541)
0.74
107.79 0.81 writeFrame (demo.c:52
@ 8048cf1)
0.29
108.11 0.32 writeFrame (demo.c:57
@ 8048dce)
0.19
108.31 0.20 writeFrame (demo.c:58
@ 8048d45)
0.04
108.35 0.04 main (demo.c:167 @
8049881)
0.03
108.38 0.03 writeFrame (demo.c:67
@ 8048e18)
0.02
108.41 0.03 main (demo.c:205 @
80496b5)
0.02
108.42 0.02 writeFrame (demo.c:66
@ 8048dde)
0.01
108.44 0.01 main (demo.c:206 @
804972e)
0.01
108.45 0.01 30000 333.33 333.33 writeFrame (demo.c:12
@ 8048c04)
0.01
108.46 0.01 main (demo.c:167 @
804937f)
0.01
108.47 0.01 main (demo.c:170 @
8049409)
0.01
108.48 0.01 main (demo.c:174 @
8049460)
0.01
108.49 0.01 main (demo.c:178 @
8049522)
0.01
108.50 0.01 main (demo.c:192 @
804959c)
0.01
108.51 0.01 main (demo.c:196 @
80495db)
0.01
108.52 0.01 writeFrame (demo.c:70
@ 8048e6c)
0.00
108.53 0.01 main (demo.c:176 @
80494cc)
0.00
108.53 0.01 main (demo.c:177 @
8049503)
0.00
108.53 0.01 main (demo.c:207 @
80497a7)
0.00
108.54 0.01 main (demo.c:208 @
80497bf)
Call
graph
granularity:
each sample hit covers 4 byte(s) for 0.01% of 108.54 seconds
index
% time self children called name
0.01
0.00 30000/30000 main (demo.c:214 @ 804983b) [31]
[24]
0.0 0.01 0.00 30000 writeFrame (demo.c:12 @
8048c04) [24]
-----------------------------------------------
Index
by function name
[25]
main (demo.c:167 @ 804937f) [29] main (demo.c:192 @ 804959c) [1]
writeFrame (demo.c:53 @ 8048cfa)
[3]
main (demo.c:169 @ 80493e4) [30] main (demo.c:196 @ 80495db) [4]
writeFrame (demo.c:52 @ 8048d23)
[26]
main (demo.c:170 @ 8049409) [21] main (demo.c:205 @ 80496b5) [14]
writeFrame (demo.c:51 @ 8048d2e)
[27]
main (demo.c:174 @ 8049460) [23] main (demo.c:206 @ 804972e) [18]
writeFrame (demo.c:58 @ 8048d45)
[33]
main (demo.c:176 @ 80494cc) [35] main (demo.c:207 @ 80497a7) [6]
writeFrame (demo.c:59 @ 8048d4e)
[34]
main (demo.c:177 @ 8049503) [36] main (demo.c:208 @ 80497bf) [5]
writeFrame (demo.c:61 @ 8048d87)
[28]
main (demo.c:178 @ 8049522) [19] main (demo.c:167 @ 8049881) [11]
writeFrame (demo.c:58 @ 8048dc2)
[15]
main (demo.c:183 @ 8049541) [24] writeFrame (demo.c:12 @ 8048c04)
[17] writeFrame (demo.c:57 @ 8048dce)
[10]
main (demo.c:184 @ 804954a) [8] writeFrame (demo.c:47 @ 8048ca3) [22]
writeFrame (demo.c:66 @ 8048dde)
[9]
main (demo.c:185 @ 804955a) [12] writeFrame (demo.c:48 @ 8048cb8)
[20] writeFrame (demo.c:67 @ 8048e18)
[7]
main (demo.c:186 @ 804956b) [13] writeFrame (demo.c:49 @ 8048cd0)
[32] writeFrame (demo.c:70 @ 8048e6c)
[2]
main (demo.c:183 @ 804957e) [16] writeFrame (demo.c:52 @ 8048cf1)
We
can now concentrate our efforts on some new regions. Lines 170-178
are responsible for converting RGB values to YUV colorspace. A quick
search for an alternative integer approximation may prove to increase
our performance; namely:
unsigned
char Y = (abs(R * 2104 +
G
* 4130 +
B
* 802 + 4096 + 131072) >> 13);
unsigned
char U = (abs(R * -1214 +
G
* -2384 +
B
* 3598 + 4096 + 1048576) >> 13);
unsigned
char V = (abs(R * 3598 +
G
* -3013 +
B
* -585 + 4096 + 1048576) >> 13);
Since
we use this conversion in two areas in our code, the replacement is
two-fold. Our new time measurements show an average execution time
of 0.9403; a performance improvement of ~43% over our original
application. This proves worthy of our hours worth of effort. We
can repeat this cycle and wring out candidate regions until we are
satisfied with our applications performance.
Conclusion
Profiling our application and focusing
our optimization on program hotspots demonstrated a 43% performance
increase for approximately an hours worth of effort. Application
optimization should begin at the crucial stage of application design.
A poorly designed application will never yield an optimized
application, regardless of how many cycles you perform the
profile/optimize cycle. A well-designed application can be profiled
by using the Gprof utility and identified candidate regions can be
concentrated on for optimization opportunities.
References
-
http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_chapter/gprof_toc.html
-
http://en.wikipedia.org/wiki/Performance_analysis
No comments:
Post a Comment