Sunday, May 12, 2019

Profiling with Gprof


I'm dusting off an old training paper I wrote some years back, still applicable and a bit of a change in pace.  Enjoy.

Brief

Profiling is a mechanism for identifying program 'hotspots'; areas where your program spends a good portion of it's time. These regions are prime candidates for optimizations, where concentration on efficiency will give you the most bang for your buck. Gprof is a GNU profiling tool available for a host of platforms and tightly integrated with the ever-so-popular Gcc compiler. This paper will walk through some basic information to utilize Gprof to identify program hotspots, optimize these regions and with-luck note the performance gains.

Introduction

Delivery of high-performance applications requires a critical step of evaluating the performance of the application. While care in design and implementation with concentration on optimization is essential in delivering high-performance applications profiling the source after implementation may identify regions where code optimization may show the greatest benefits. This paper will demonstrate the procedure for profiling an application, identify program 'hotspots' and optimize these hotspots with intent to show performance gains. The example application is an example produced for an independent course with no artificially introduced optimization candidates. With luck, at papers end the application will be optimized and demonstrate a performance gain.

Discussion

Profiling with Gprof consists of 3 phases: 1) compiling your application with profiling enabled, 2) executing the application to gather profiling metrics, and 3) evaluation of the collected metrics.

Compiling

Enabling profiling requires compiling your application with Gcc much as you'd normally do. The difference however is the introduction of two flags that you may not normally provide (-pg and -g). The -pg option enables profiling, the -g option introduces debugging symbols. The -g option is not too uncommon, but has particular significance when profiling as it provides the required debugging symbols for line-by-line profiling. The absence of the -g flag will result in the profiler gathering and reporting measures on a per function basis. Since we are interested in profiling on a line-by-line basis, we will provide both flags; as follows:
$ gcc -std=c99 -g -pg demo.c -ltheora -lm -o demo

Execution

Execution of your application takes the same form as your normally run it. You should specify the same command line arguments, your inputs and outputs are identical as that while running without profiling enabled. The only possibly observed difference is that your program may run slower as it acquires and tallies your profiling measures. It is worth noting that your application must terminate in a normal fashion, to include returning from the main() function, or via calling exit(), abrupt termination in a non-typical fashion will interrupt profiling and result in no profiling information collected.
The normal termination of your program will result in the profiler generating an output file (gmon.out) just prior to exiting. The location of this output file is generally in the current working directory, however if your application makes a habit of changing directories you may file this output file in the present working directory of your application prior to terminating. The absence of this output file hints at the lack of providing the -pg option during compilation, or the possibility of your application terminating in a non-normal manner.

$ ./demo

Evaluation

Now that your application has been compiled with profiling enabled, executed your application in a nominal fashion, and profiling metrics have been collected. The final step in profiling is the evaluation of the collected metrics. This is explicitly where the gprof utility is used. The default report style is done simply by executing gprof while specifying the executable as follows:
$ gprof demo
This command will evaluate the profiler metrics and generate a default profiler report. We have yet to see the example application source, as it becomes relevant it is listed as follows:


1 // example used for Theora quick survey paper
2 #include <stdio.h>
3 #include <stdlib.h>
4 #include <ogg/ogg.h>
5 #include <theora/theora.h>
6 #include <string.h>
7 #include <math.h>
8 #include <assert.h>
9
10 // this method encodes the specified frame and writes to the ogg file
11 void writeFrame( FILE *fp, theora_state *theoraState,
ogg_stream_state *ogg_os,
12 unsigned w, unsigned h, unsigned char *yuv) {
13 // write the frame
14 yuv_buffer yuv_buf;
15 ogg_packet op;
16 ogg_page og;
17
18 unsigned long yuv_w;
19 unsigned long yuv_h;
20
21 unsigned char *yuv_y;
22 unsigned char *yuv_u;
23 unsigned char *yuv_v;
24
25 unsigned int x;
26 unsigned int y;
27
28 /* Must hold: yuv_w >= w */
29 yuv_w = (w + 15) & ~15;
30
31 /* Must hold: yuv_h >= h */
32 yuv_h = (h + 15) & ~15;
33
34 yuv_y = (unsigned char*)malloc(yuv_w * yuv_h);
35 yuv_u = (unsigned char*)malloc(yuv_w * yuv_h / 4);
36 yuv_v = (unsigned char*)malloc(yuv_w * yuv_h / 4);
37 yuv_buf.y_width = yuv_w;
38 yuv_buf.y_height = yuv_h;
39 yuv_buf.y_stride = yuv_w;
40 yuv_buf.uv_width = yuv_w >> 1;
41 yuv_buf.uv_height = yuv_h >> 1;
42 yuv_buf.uv_stride = yuv_w >> 1;
43 yuv_buf.y = yuv_y;
44 yuv_buf.u = yuv_u;
45 yuv_buf.v = yuv_v;
46
47 for(y = 0; y < yuv_h; y++) {
48 for(x = 0; x < yuv_w; x++) {
49 yuv_y[x + y * yuv_w] = 0;
50 }
51 }
52 for(y = 0; y < yuv_h; y += 2) {
53 for(x = 0; x < yuv_w; x += 2) {
54 yuv_u[(x >> 1) + (y >> 1) * (yuv_w >> 1)] = 0;
55 yuv_v[(x >> 1) + (y >> 1) * (yuv_w >> 1)] = 0;
56 }
57 }
58
59 for(y = 0; y < h; y++) {
60 for(x = 0; x < w; x++) {
61 yuv_y[x + y * yuv_w] = yuv[3 * (x + y * w) + 0];
62 }
63 }
64
65 for(y = 0; y < h; y += 2) {
66 for(x = 0; x < w; x += 2) {
67 yuv_u[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
68 yuv[3 * (x + y * w) + 1];
69 yuv_v[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
70 yuv[3 * (x + y * w) + 2];
71 }
72 }
73
74 assert(0 == theora_encode_YUVin(theoraState, &yuv_buf));
75 assert(0 != theora_encode_packetout(theoraState, 0, &op));
76
77 ogg_stream_packetin(ogg_os, &op);
78 if(ogg_stream_pageout(ogg_os, &og)) {
79 fwrite(og.header, og.header_len, 1, fp);
80 fwrite(og.body, og.body_len, 1, fp);
81 }
82
83 free(yuv_y);
84 free(yuv_u);
85 free(yuv_v);
86 }
87
88 int main() {
89 // frame dimensions
90 const unsigned Width = 200;
91 const unsigned Height = 200;
92 // frames per second
93 const unsigned Fps = 10;
94 // video quality 0-63 (63-highest quality)
95 const int VideoQuality = 63;
96 const unsigned NumFrames = 30 * Fps; // 30-sec video
97
98 // this is the codec's internal state and context
99 theora_state theoraState;
100
101 // Info Header
102 theora_info encoderInfo;
103 theora_info_init(&encoderInfo);
104
105 encoderInfo.width = ((Width + 15) >>4)<<4; // encoder frame must
// be defined
106 encoderInfo.height = ((Height + 15)>>4)<<4; // in multiples of 16
107 encoderInfo.frame_width = Width;
108 encoderInfo.frame_height = Height;
109 encoderInfo.offset_x = 0;
110 encoderInfo.offset_y = 0;
111 encoderInfo.fps_numerator = Fps;
112 encoderInfo.fps_denominator = 1;
113 encoderInfo.aspect_numerator = Width;
114 encoderInfo.aspect_denominator = Height;
115 encoderInfo.colorspace = OC_CS_UNSPECIFIED;
116 encoderInfo.pixelformat = OC_PF_420;
117 encoderInfo.target_bitrate = 0;
118 encoderInfo.quality = VideoQuality;
119 encoderInfo.dropframes_p = 0;
120 encoderInfo.quick_p = 1;
121 encoderInfo.keyframe_auto_p = 1;
122 encoderInfo.keyframe_frequency = 64;
123 encoderInfo.keyframe_frequency_force = 64;
124 encoderInfo.keyframe_data_target_bitrate = 0;
125 encoderInfo.keyframe_mindistance = 8;
126 encoderInfo.noise_sensitivity = 1;
127
128 // initialize the codecs state
129 assert(0 == theora_encode_init(&theoraState, &encoderInfo));
130
131 // Comment Header
132 theora_comment tc;
133 theora_comment_init(&tc);
134 theora_comment_add(&tc,"AUTHOR=me");
135 theora_comment_add(&tc,"BRIEF=demo generated file");
136
137 FILE *fp = fopen("demo.ogg", "wb");
138 {
139 ogg_stream_state ogg_os;
140 ogg_packet op;
141 ogg_page og;
142
143 assert(0 == ogg_stream_init(&ogg_os, rand()));
144
145 // encode info header in data packet
146 theora_encode_header(&theoraState, &op);
147 ogg_stream_packetin(&ogg_os, &op);
148 if(ogg_stream_pageout(&ogg_os, &og)) {
149 fwrite(og.header, og.header_len, 1, fp);
150 fwrite(og.body, og.body_len, 1, fp);
151 }
152
153 // encode comment header in data packet
154 theora_encode_comment(&tc, &op);
155 ogg_stream_packetin(&ogg_os, &op);
156 if(ogg_stream_pageout(&ogg_os, &og)) {
157 fwrite(og.header, og.header_len, 1, fp);
158 fwrite(og.body, og.body_len, 1, fp);
159 }
160
161 // now encode the tables
162 theora_encode_tables(&theoraState, &op);
163 ogg_stream_packetin(&ogg_os, &op);
164 if(ogg_stream_pageout(&ogg_os, &og)) {
165 fwrite(og.header, og.header_len, 1, fp);
166 fwrite(og.body, og.body_len, 1, fp);
167 }
168
169 // flush all headers to packet
170 if(ogg_stream_flush(&ogg_os, &og)) {
171 fwrite(og.header, og.header_len, 1, fp);
172 fwrite(og.body, og.body_len, 1, fp);
173 }
174
175 for (int i=0; i<NumFrames; ++i) {
176 unsigned char yuv[Width*Height*3];
177 memset(yuv,0,Width*Height*3);
178 const double R = 255.0;
179 const double G = 255.0;
180 const double B = 255.0;
181 const double dY = (0.257 * R) + (0.504 * G) + (0.098 * B) + 16;
182 const double dU = -(0.148 * R) - (0.291 * G) + (0.439 * B) + 128;
183 const double dV = (0.439 * R) - (0.368 * G) - (0.071 * B) + 128;
184 const unsigned char Y = (unsigned char)dY;
185 const unsigned char U = (unsigned char)dU;
186 const unsigned char V = (unsigned char)dV;
187
188 // generate the frame background with specified color
189 {
190 unsigned i;
191 for(i = 0; i < Width*Height*3; i+=3) {
192 yuv[i] = Y;
193 yuv[i+1] = U;
194 yuv[i+2] = V;
195 }
196 }
197
198 // -- generate spinning dot of a foreground color --
199 {
200 const double R = 0.0;
201 const double G = 0.0;
202 const double B = 0.0;
203 const double dY = (0.257 * R) + (0.504 * G) + (0.098 * B) + 16;
204 const double dU = -(0.148 * R) - (0.291 * G) + (0.439 * B) + 128;
205 const double dV = (0.439 * R) - (0.368 * G) - (0.071 * B) + 128;
206 const unsigned char Y = (unsigned char)dY;
207 const unsigned char U = (unsigned char)dU;
208 const unsigned char V = (unsigned char)dV;
209 const unsigned cX = Width/2;
210 const unsigned cY = Height/2;
211 const unsigned Radius = 50;
212 static double theta = 0.0;
213 const unsigned x = (unsigned)(Radius * sin(theta) + cX);
214 const unsigned y = (unsigned)(Radius * cos(theta) + cY);
215 const unsigned k=3*(x + (Width*y));
216 theta -= 5.0 * 3.14159/180.0;
217 yuv[k] = Y;
218 yuv[k+1] = G;
219 yuv[k+2] = B;
220 }
221
222 writeFrame(fp,&theoraState,&ogg_os,Width, Height, yuv);
223 }
224
225 // prepare and close up the packet
226 theora_encode_packetout(&theoraState, 1, &op);
227 if(ogg_stream_pageout(&ogg_os, &og)) {
228 fwrite(og.header, og.header_len, 1, fp);
229 fwrite(og.body, og.body_len, 1, fp);
230 }
231
232 theora_info_clear(&encoderInfo);
233 theora_clear(&theoraState);
234
235 fflush(fp);
236 fclose(fp);
237 }
238 return EXIT_SUCCESS;
239 }

Since we are interested in line-by-line profiling, we will forgo the default report and generate a line-by-line profile report as follows:
$ gprof -l -b demo

The -b argument is to surpress some default output and generate a brief report; the output is as follows:
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
17.81 0.26 0.26 writeFrame (demo.c:61 @ 8048d54)
13.36 0.46 0.20 main (demo.c:191 @ 80495d5)
13.01 0.65 0.19 main (demo.c:177 @ 804943b)
11.64 0.81 0.17 writeFrame (demo.c:48 @ 8048cc4)
9.59 0.95 0.14 writeFrame (demo.c:60 @ 8048d7d)
6.16 1.04 0.09 writeFrame (demo.c:67 @ 8048da8)
5.48 1.12 0.08 main (demo.c:194 @ 80495c2)
5.48 1.21 0.08 writeFrame (demo.c:69 @ 8048de1)
4.11 1.26 0.06 main (demo.c:193 @ 80495b1)
3.42 1.31 0.05 main (demo.c:192 @ 80495a1)
3.42 1.36 0.05 writeFrame (demo.c:55 @ 8048d0b)
3.42 1.42 0.05 writeFrame (demo.c:53 @ 8048d2a)
1.37 1.44 0.02 writeFrame (demo.c:54 @ 8048cec)
0.68 1.45 0.01 writeFrame (demo.c:49 @ 8048cb4)
0.68 1.46 0.01 writeFrame (demo.c:52 @ 8048d36)
0.34 1.46 0.01 main (demo.c:200 @ 80495f3)
0.00 1.46 0.00 300 0.00 0.00 writeFrame (demo.c:12 @ 8048c04)


Call graph


granularity: each sample hit covers 4 byte(s) for 0.68% of 1.46 seconds

index % time self children called name
0.00 0.00 300/300 main (demo.c:222 @ 8049892) [121]
[17] 0.0 0.00 0.00 300 writeFrame (demo.c:12 @ 8048c04) [17]
-----------------------------------------------


Index by function name

[3] main (demo.c:177 @ 804943b) [17] writeFrame (demo.c:12 @ 8048c04) [15] writeFrame (demo.c:52 @ 8048d36)
[10] main (demo.c:192 @ 80495a1) [14] writeFrame (demo.c:49 @ 8048cb4) [1] writeFrame (demo.c:61 @ 8048d54)
[9] main (demo.c:193 @ 80495b1) [4] writeFrame (demo.c:48 @ 8048cc4) [5] writeFrame (demo.c:60 @ 8048d7d)
[7] main (demo.c:194 @ 80495c2) [13] writeFrame (demo.c:54 @ 8048cec) [6] writeFrame (demo.c:67 @ 8048da8)
[2] main (demo.c:191 @ 80495d5) [11] writeFrame (demo.c:55 @ 8048d0b) [8] writeFrame (demo.c:69 @ 8048de1)
[16] main (demo.c:200 @ 80495f3) [12] writeFrame (demo.c:53 @ 8048d2a)

The report summarizes a good deal of information about the application run but doesn't give a total execution time to determine overall performance gains. We can time the execution of the application for a few samples to calculate baseline performance statistics.
$ time ./demo

A series of 3 runs shows an average real application run-time of 1.6553 seconds; we'll define this as our base performance metric.

The report is ordered with respect to decreasing run time spent on each line. If we focus our attention on the heavy-hitters in decreasing order we'll focus our attention in one the following:
Run-Time Percentage
Line
17.81
61
13.36
191
13.01
177
11.64
48
9.59
60
6.16
67
5.48
194
5.48
69
4.11
193
3.42
192
3.42
55
3.42
53
1.37
54
0.68
49
0.68
52
0.34
200

Working our way down the candidate list we find lines 61, 191, 177 and so on we find optimization of these lines to be of a tough breed. Lines 48-49 and 52 however can readily be replaced with calls to memset. Lines 47-57 become

Timing our application once again shows an average of 1.481 seconds, showing a performance improvement of approximately 11%.

The default application encodes 300 frames, resulting in profiler metrics being collected for 300 calls. Profiler sample counts of 0.01 seconds implies that cumulative profiler measurements less than the sample count cannot be trusted. Increasing the number of iterations allows for cumulative results that exceed this sample count value and therefore increases our candidate optimization regions. Increasing the iterations to 30000 and rerunning profiler collection show a significantly different report on our updated source code.
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
16.77 18.20 18.20 writeFrame (demo.c:53 @ 8048cfa)
16.25 35.84 17.64 main (demo.c:183 @ 804957e)
14.96 52.08 16.24 main (demo.c:169 @ 80493e4)
11.15 64.18 12.11 writeFrame (demo.c:52 @ 8048d23)
7.59 72.42 8.24 writeFrame (demo.c:61 @ 8048d87)
6.84 79.84 7.42 writeFrame (demo.c:59 @ 8048d4e)
6.22 86.59 6.75 main (demo.c:186 @ 804956b)
4.99 92.02 5.42 writeFrame (demo.c:47 @ 8048ca3)
4.95 97.39 5.38 main (demo.c:185 @ 804955a)
2.47 100.07 2.68 main (demo.c:184 @ 804954a)
2.31 102.58 2.50 writeFrame (demo.c:58 @ 8048dc2)
1.32 104.00 1.43 writeFrame (demo.c:48 @ 8048cb8)
1.13 105.23 1.23 writeFrame (demo.c:49 @ 8048cd0)
0.82 106.12 0.89 writeFrame (demo.c:51 @ 8048d2e)
0.79 106.98 0.86 main (demo.c:183 @ 8049541)
0.74 107.79 0.81 writeFrame (demo.c:52 @ 8048cf1)
0.29 108.11 0.32 writeFrame (demo.c:57 @ 8048dce)
0.19 108.31 0.20 writeFrame (demo.c:58 @ 8048d45)
0.04 108.35 0.04 main (demo.c:167 @ 8049881)
0.03 108.38 0.03 writeFrame (demo.c:67 @ 8048e18)
0.02 108.41 0.03 main (demo.c:205 @ 80496b5)
0.02 108.42 0.02 writeFrame (demo.c:66 @ 8048dde)
0.01 108.44 0.01 main (demo.c:206 @ 804972e)
0.01 108.45 0.01 30000 333.33 333.33 writeFrame (demo.c:12 @ 8048c04)
0.01 108.46 0.01 main (demo.c:167 @ 804937f)
0.01 108.47 0.01 main (demo.c:170 @ 8049409)
0.01 108.48 0.01 main (demo.c:174 @ 8049460)
0.01 108.49 0.01 main (demo.c:178 @ 8049522)
0.01 108.50 0.01 main (demo.c:192 @ 804959c)
0.01 108.51 0.01 main (demo.c:196 @ 80495db)
0.01 108.52 0.01 writeFrame (demo.c:70 @ 8048e6c)
0.00 108.53 0.01 main (demo.c:176 @ 80494cc)
0.00 108.53 0.01 main (demo.c:177 @ 8049503)
0.00 108.53 0.01 main (demo.c:207 @ 80497a7)
0.00 108.54 0.01 main (demo.c:208 @ 80497bf)


Call graph


granularity: each sample hit covers 4 byte(s) for 0.01% of 108.54 seconds

index % time self children called name
0.01 0.00 30000/30000 main (demo.c:214 @ 804983b) [31]
[24] 0.0 0.01 0.00 30000 writeFrame (demo.c:12 @ 8048c04) [24]
-----------------------------------------------


Index by function name

[25] main (demo.c:167 @ 804937f) [29] main (demo.c:192 @ 804959c) [1] writeFrame (demo.c:53 @ 8048cfa)
[3] main (demo.c:169 @ 80493e4) [30] main (demo.c:196 @ 80495db) [4] writeFrame (demo.c:52 @ 8048d23)
[26] main (demo.c:170 @ 8049409) [21] main (demo.c:205 @ 80496b5) [14] writeFrame (demo.c:51 @ 8048d2e)
[27] main (demo.c:174 @ 8049460) [23] main (demo.c:206 @ 804972e) [18] writeFrame (demo.c:58 @ 8048d45)
[33] main (demo.c:176 @ 80494cc) [35] main (demo.c:207 @ 80497a7) [6] writeFrame (demo.c:59 @ 8048d4e)
[34] main (demo.c:177 @ 8049503) [36] main (demo.c:208 @ 80497bf) [5] writeFrame (demo.c:61 @ 8048d87)
[28] main (demo.c:178 @ 8049522) [19] main (demo.c:167 @ 8049881) [11] writeFrame (demo.c:58 @ 8048dc2)
[15] main (demo.c:183 @ 8049541) [24] writeFrame (demo.c:12 @ 8048c04) [17] writeFrame (demo.c:57 @ 8048dce)
[10] main (demo.c:184 @ 804954a) [8] writeFrame (demo.c:47 @ 8048ca3) [22] writeFrame (demo.c:66 @ 8048dde)
[9] main (demo.c:185 @ 804955a) [12] writeFrame (demo.c:48 @ 8048cb8) [20] writeFrame (demo.c:67 @ 8048e18)
[7] main (demo.c:186 @ 804956b) [13] writeFrame (demo.c:49 @ 8048cd0) [32] writeFrame (demo.c:70 @ 8048e6c)
[2] main (demo.c:183 @ 804957e) [16] writeFrame (demo.c:52 @ 8048cf1)

We can now concentrate our efforts on some new regions. Lines 170-178 are responsible for converting RGB values to YUV colorspace. A quick search for an alternative integer approximation may prove to increase our performance; namely:

unsigned char Y = (abs(R * 2104 +
G * 4130 +
B * 802 + 4096 + 131072) >> 13);
unsigned char U = (abs(R * -1214 +
G * -2384 +
B * 3598 + 4096 + 1048576) >> 13);
unsigned char V = (abs(R * 3598 +
G * -3013 +
B * -585 + 4096 + 1048576) >> 13);

Since we use this conversion in two areas in our code, the replacement is two-fold. Our new time measurements show an average execution time of 0.9403; a performance improvement of ~43% over our original application. This proves worthy of our hours worth of effort. We can repeat this cycle and wring out candidate regions until we are satisfied with our applications performance.

Conclusion

Profiling our application and focusing our optimization on program hotspots demonstrated a 43% performance increase for approximately an hours worth of effort. Application optimization should begin at the crucial stage of application design. A poorly designed application will never yield an optimized application, regardless of how many cycles you perform the profile/optimize cycle. A well-designed application can be profiled by using the Gprof utility and identified candidate regions can be concentrated on for optimization opportunities.

References

  1. http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_chapter/gprof_toc.html
  2. http://en.wikipedia.org/wiki/Performance_analysis

No comments:

Post a Comment