Monday, May 27, 2019

Introduction to the GNU Build System


Brief

Chances are that if you've ever obtained a software product that defined a 3-step installation procedure of 'configure', 'make' and 'make install', then you've utilized the GNU Build Environment (also known as autotools). This environment has become the de facto standard in deploying software from source on Posix-like systems. The goal of this paper is to introduce you to the build environment enough to incorporate it into your own software products.

Introduction

The GNU Build System (also known as autotools) is a suite of tools authored by Gnu used to make portable software products. The build system will be viewed from two perspectives; 1) the user perspective as the user of the product, and 2) the developers perspective as the developer of the product. The remainder of this paper will discuss the system from both perspectives.


Discussion

To begin understanding the GNU Build Environment, you must first view it from two perspectives; that of the user, and that of the developer.

User Perspective

After obtaining the software archive, the user simply wants to configure, build, test and install the product. The easier this is to do, the more likely the user will use the product. A competing objective however is to not sacrifice flexibility. The GNU build environment offers both flexibility as well as simplicity from the user perspective. The user proceeds with a 3-step procedure.

Step 1 – Configure

This step is responsible for interrogating the users system and makes note of the systems environment, platform, architecture and the existence of required features that the software product requires. A plethora of configurable parameters are available to the user in specifying how the product will be installed. A notable example of a configurable parameter is the location of which the product will be installed. Each configurable parameter has a specified default and specifying an alternative is only necessary if the default is not acceptable.
Successful execution of the configuration script results in a customized Makefile for the users system. This allows the user to proceed to the next step; building the software from source. If however, the configuration script fails to locate essential features on the users system the user must resolve the issue(s) and repeat this step until successful.

Step 2 – Make

The generated Makefile from step 1 supports this step as well as the final installation step. The makefile is populated with the discovered specifics of the users system discovered in the previous step. The default makefile target builds the software, allowing the user to test and confirm the system before proceeding to the installation step. If the user finds that the product does not meet the users needs he/she can simply stop at this step and discontinue installation deleting the temporary source archive at their convenience. If the product is deemed worth installing the user proceeds to the final step.

Step 3 – Make Install

The third and final step involves installing the product which involves copying the binaries, documentation, libraries and like to their final destinations. While these destinations can be specified at time of configuration, the defaults generally specify a subdirectory of /usr. As a result, this step normally requires superuser (root on *nix variants).

Developers Perspective

The remainder of this section will guide you step-by-step through an example, which incorporates the build environment.

We start by creating a subdirectory, which will be the location of our product. This is necessary because throughout the course of this procedure a good number of files will be generated as well as directories.

localhost:~$ mkdir autotools/
localhost:~$ cd autotools/

Before we go further it is important to understand our primary goal, which is to give the user the previously defined experience. The initial interface for the user is the 'configure' script, all other interfaces (e.g. Makefile) are auto-generated from this script with input files that will present themselves shortly. The creation of the configuration script therefore is our goal of all to come.

To produce a configuration script for a software product you create a configure.ac file that contains invocations of the autoconf macros. The macros interrogate the system testing for features necessary for the software product. For reasons that will be explained later generate the configure.ac file as specified:

Step-1

localhost:~/autotools$ cat configure.ac
AC_PREREQ(2.59)
AC_INIT(FULL-PACKAGE-NAME, VERSION, BUG-REPORT-ADDRESS)
AM_INIT_AUTOMAKE(someApp,2.1)
AC_CONFIG_HEADER([config.h])
AC_PROG_CC
AC_OUTPUT(Makefile)

Also generate the following files as specified:

localhost:~/autotools$ cat myMain.c
#include <stdio.h>
#include "myMath.h"

int main()
{
const double radius=15.0;
const double area =
areaOfCircle(radius);
printf("(%s:%d) the area of a circle” +
with radius %.2lf ft == %.2lf”+
ft^2\n",
__FILE__,__LINE__,radius,area);
}

localhost:~/autotools$ cat myMath.h
#ifndef MYMATH_H
#define MYMATH_H
double areaOfCircle (double radius);
#endif

localhost:~/autotools$ cat myMath.c
#include "myMath.h"
#include <math.h>

double areaOfCircle (double radius)
{
const double retVal =
pow(radius,2.0)*M_PI;
return retVal;
}

localhost:~/autotools$ cat Makefile.am
bin_PROGRAMS=someApp
someApp_SOURCES=myMain.c myMath.c

Autoconf is a tool that generates the configure script from directives in the configure.ac file. Autoconf is a macro expander, an application that repeatedly performs macro expansions on input replacing macro declarations with macro bodies and at end produces a pure sh configuration script. Since sh is available on Posix-like systems, this ensures the highest degree of portability. The next step is done specifically because autoconf is a macro expander.

Step-2

localhost:~/autotools$ aclocal
localhost:~/autotools$ ls
aclocal.m4 configure.ac myMain.c myMath.h autom4te.cache Makefile.am myMath.c

Notice after executing aclocal the existence of aclocal.m4 and autom4te.cache. The application aclocal creates the file aclocal.m4 by combining stock installed macros, user defined macros to define all the macros required by configure.ac. The autom4te.cache directory is simply a cache for the other tools so later invocations of autoheader, automake, or other tools will not require re-processing configure.ac. This can result in speed-ups of 30x or better. This cache directory may be deleted as you wish and will be regenerated upon execution of autoconf, our next step.

Step-3

localhost:~/autotools$ autoconf
localhost:~/autotools$ ls
aclocal.m4 configure Makefile.am myMath.c autom4te.cache configure.ac myMain.c myMath.h

After execution of autoconf, the configure script exists and we are a good way toward home. The GNU coding standards require the existence of a few files to include NEWS, README, AUTHORS, and ChangeLog. The absence of these files will cause a good deal of warnings in the future so we will generate them in this next step.

Step-4

localhost:~/autotools$ touch NEWS README AUTHORS ChangeLog
localhost:~/autotools$ ls
aclocal.m4 autom4te.cache configure Makefile.am myMath.c NEWS
AUTHORS ChangeLog configure.ac myMain.c myMath.h README

Step-5

The next step will create a template file for C-style #define statements for configure to use. The execution of autoheader will result in the generation of config.h.in file which will later be used by configure.

localhost:~/autotools$ autoheader
localhost:~/autotools$ ls
aclocal.m4 ChangeLog configure.ac myMath.c README AUTHORS config.h.in Makefile.am myMath.h autom4te.cache configure myMain.c NEWS

Step-6

One last step Sparky and you've got an environment. Automake is a tool that automatically generates a Makefile.in that is compliant with GNU Coding Standards.

localhost:~/autotools$ automake -a
lsconfigure.ac:3: installing `./missing'
configure.ac:3: installing `./install-sh'

Makefile.am: installing `./INSTALL'
Makefile.am: installing `./COPYING'
Makefile.am: installing `./depcomp'
localhost:~/autotools$ ls
aclocal.m4 config.h.in depcomp Makefile.in myMath.h
AUTHORS configure INSTALL missing NEWS
autom4te.cache configure.ac install-sh myMain.c README
ChangeLog COPYING Makefile.am myMath.c

Now without knowing the details of what is really going on, you've now got a configure script that gives the user the experience we earlier described. We can now reexamine our steps in a bit more detail.

Step-1 Revisited

Configure.ac
localhost:~/autotools$ cat -n configure.ac
1 AC_PREREQ(2.59)
2 AC_INIT(FULL-PACKAGE-NAME, VERSION, BUG-REPORT-ADDRESS)
3 AM_INIT_AUTOMAKE(someApp,2.1)
4 AC_CONFIG_HEADER([config.h])
5 AC_PROG_CC
6 AC_OUTPUT(Makefile)
7


Line 3 is of particular significance, which specifies the name of the product someApp and product version 2.1. The product and version are of particular interest when distributing the product. Line 4 specifies that you want a config.h.in file to be generated by autoheader. If this line does not exist autoheader will fail in creating the file.
I'll leave the details of the remaining lines undefined, you may look them up at your convenience.


Makefile.am
localhost:~/autotools$ cat -n Makefile.am
1 bin_PROGRAMS=someApp
2 someApp_SOURCES=myMain.c myMath.c
3
This file defines a template for automake to generate the Makefile.in file. Line 1 specifies the name of the application, the following line specifies the sources that define the application. These two lines indirectly result in the necessary Makefiles to build, distribute, and install the product. The needs to author your own Makefiles are no longer required since this method is far more portable.

Step-2 Revisited

Not much to revisit in this step. It simply defines all macros required by configure.ac.


Step-3 Revisited

Autoconf parses the configure.ac file and generates the configure script.


Step-4 Revisited

The created files in step 4 are in compliance with the GNU Coding Standards. They should be populated as per conventions located at www.gnu.org.


Step-5 Revisited

The autoheader step acts on the directive on line 4 of the configure.ac which simply put says you to create a template header for configure (config.h.in). This template will later be used to generate a config.h file that can be imported in your applications source. Various macro definitions that identify the existence of features and files are located in this file. These directives can be used to make portable code.


Step-6 Revisited

Automake helps create portable Makefile(s) which are then processed by the make utility. It takes the Makefile.am file as input and generates a Makefile.in as output. This file is later used by configure.



Conclusion

You’ve now configured a product to adhere to the GNU Build Environment.

Place yourself in the user role specifying. configure, make,& make install will result in installing the product.

Place yourself in the developer role, specifying ‘make dist’ will result in a compressed tarball with your product, ready for distribution.


References



Monday, May 20, 2019

Bubble Gum Engineering

A few years back I worked for a defense contractor that suffered from what I call "bubble gum engineering".  'Bubble Bum Engineering' refers to the practice of adopting a new technology followed by discontinuing after a short period of time. Kind of like spitting out the bubble gum after the flavor is gone.

Our program was complex, a mission-critical self-propelled fully automated weapon system, and taming that complexity with a compressed schedule was a challenge, but one that we capable of addressing. What made the program more challenging than it needed to be was the fact that some of our software managers would take business trips across the country for meetings, and this gave them opportunities to sit next to anonymous vendors peddling their 'wares' and ample time to read up on the new 'sexy' methodologies. They'd drink in the articles or vendor solicitations with wide eyes and eager wallets only to bring back these 'snake oils' announced by a routine all-hands meeting.

Off we'd go with a dramatic change of direction, fast-n-furious with new training, new procedures, and new development mechanisms! Then, shortly after we'd settle in on the new methodology the bubble gum had lost it's flavor to our managers, who'd replace it with the even newer methodology/technique acquired on their last business trip.

Time-after-time we'd be directed a change of course, time-after-time we'd accept these new challenges with depleting zeal slowly chipping away at our schedule, our software stability, and our morale.

I've learned through these challenging trials that there is a limit to reducing the complexity of a problem. Fact is, the problem space of a program defines the minimal complexity of the system, however there is no limit (read that as infinite) to the addition of added complexity. You can take a moderately difficult problem and make it infinitely complex. That is the lesson I learned from that experience anyway.

Sunday, May 12, 2019

Profiling with Gprof


I'm dusting off an old training paper I wrote some years back, still applicable and a bit of a change in pace.  Enjoy.

Brief

Profiling is a mechanism for identifying program 'hotspots'; areas where your program spends a good portion of it's time. These regions are prime candidates for optimizations, where concentration on efficiency will give you the most bang for your buck. Gprof is a GNU profiling tool available for a host of platforms and tightly integrated with the ever-so-popular Gcc compiler. This paper will walk through some basic information to utilize Gprof to identify program hotspots, optimize these regions and with-luck note the performance gains.

Introduction

Delivery of high-performance applications requires a critical step of evaluating the performance of the application. While care in design and implementation with concentration on optimization is essential in delivering high-performance applications profiling the source after implementation may identify regions where code optimization may show the greatest benefits. This paper will demonstrate the procedure for profiling an application, identify program 'hotspots' and optimize these hotspots with intent to show performance gains. The example application is an example produced for an independent course with no artificially introduced optimization candidates. With luck, at papers end the application will be optimized and demonstrate a performance gain.

Discussion

Profiling with Gprof consists of 3 phases: 1) compiling your application with profiling enabled, 2) executing the application to gather profiling metrics, and 3) evaluation of the collected metrics.

Compiling

Enabling profiling requires compiling your application with Gcc much as you'd normally do. The difference however is the introduction of two flags that you may not normally provide (-pg and -g). The -pg option enables profiling, the -g option introduces debugging symbols. The -g option is not too uncommon, but has particular significance when profiling as it provides the required debugging symbols for line-by-line profiling. The absence of the -g flag will result in the profiler gathering and reporting measures on a per function basis. Since we are interested in profiling on a line-by-line basis, we will provide both flags; as follows:
$ gcc -std=c99 -g -pg demo.c -ltheora -lm -o demo

Execution

Execution of your application takes the same form as your normally run it. You should specify the same command line arguments, your inputs and outputs are identical as that while running without profiling enabled. The only possibly observed difference is that your program may run slower as it acquires and tallies your profiling measures. It is worth noting that your application must terminate in a normal fashion, to include returning from the main() function, or via calling exit(), abrupt termination in a non-typical fashion will interrupt profiling and result in no profiling information collected.
The normal termination of your program will result in the profiler generating an output file (gmon.out) just prior to exiting. The location of this output file is generally in the current working directory, however if your application makes a habit of changing directories you may file this output file in the present working directory of your application prior to terminating. The absence of this output file hints at the lack of providing the -pg option during compilation, or the possibility of your application terminating in a non-normal manner.

$ ./demo

Evaluation

Now that your application has been compiled with profiling enabled, executed your application in a nominal fashion, and profiling metrics have been collected. The final step in profiling is the evaluation of the collected metrics. This is explicitly where the gprof utility is used. The default report style is done simply by executing gprof while specifying the executable as follows:
$ gprof demo
This command will evaluate the profiler metrics and generate a default profiler report. We have yet to see the example application source, as it becomes relevant it is listed as follows:


1 // example used for Theora quick survey paper
2 #include <stdio.h>
3 #include <stdlib.h>
4 #include <ogg/ogg.h>
5 #include <theora/theora.h>
6 #include <string.h>
7 #include <math.h>
8 #include <assert.h>
9
10 // this method encodes the specified frame and writes to the ogg file
11 void writeFrame( FILE *fp, theora_state *theoraState,
ogg_stream_state *ogg_os,
12 unsigned w, unsigned h, unsigned char *yuv) {
13 // write the frame
14 yuv_buffer yuv_buf;
15 ogg_packet op;
16 ogg_page og;
17
18 unsigned long yuv_w;
19 unsigned long yuv_h;
20
21 unsigned char *yuv_y;
22 unsigned char *yuv_u;
23 unsigned char *yuv_v;
24
25 unsigned int x;
26 unsigned int y;
27
28 /* Must hold: yuv_w >= w */
29 yuv_w = (w + 15) & ~15;
30
31 /* Must hold: yuv_h >= h */
32 yuv_h = (h + 15) & ~15;
33
34 yuv_y = (unsigned char*)malloc(yuv_w * yuv_h);
35 yuv_u = (unsigned char*)malloc(yuv_w * yuv_h / 4);
36 yuv_v = (unsigned char*)malloc(yuv_w * yuv_h / 4);
37 yuv_buf.y_width = yuv_w;
38 yuv_buf.y_height = yuv_h;
39 yuv_buf.y_stride = yuv_w;
40 yuv_buf.uv_width = yuv_w >> 1;
41 yuv_buf.uv_height = yuv_h >> 1;
42 yuv_buf.uv_stride = yuv_w >> 1;
43 yuv_buf.y = yuv_y;
44 yuv_buf.u = yuv_u;
45 yuv_buf.v = yuv_v;
46
47 for(y = 0; y < yuv_h; y++) {
48 for(x = 0; x < yuv_w; x++) {
49 yuv_y[x + y * yuv_w] = 0;
50 }
51 }
52 for(y = 0; y < yuv_h; y += 2) {
53 for(x = 0; x < yuv_w; x += 2) {
54 yuv_u[(x >> 1) + (y >> 1) * (yuv_w >> 1)] = 0;
55 yuv_v[(x >> 1) + (y >> 1) * (yuv_w >> 1)] = 0;
56 }
57 }
58
59 for(y = 0; y < h; y++) {
60 for(x = 0; x < w; x++) {
61 yuv_y[x + y * yuv_w] = yuv[3 * (x + y * w) + 0];
62 }
63 }
64
65 for(y = 0; y < h; y += 2) {
66 for(x = 0; x < w; x += 2) {
67 yuv_u[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
68 yuv[3 * (x + y * w) + 1];
69 yuv_v[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
70 yuv[3 * (x + y * w) + 2];
71 }
72 }
73
74 assert(0 == theora_encode_YUVin(theoraState, &yuv_buf));
75 assert(0 != theora_encode_packetout(theoraState, 0, &op));
76
77 ogg_stream_packetin(ogg_os, &op);
78 if(ogg_stream_pageout(ogg_os, &og)) {
79 fwrite(og.header, og.header_len, 1, fp);
80 fwrite(og.body, og.body_len, 1, fp);
81 }
82
83 free(yuv_y);
84 free(yuv_u);
85 free(yuv_v);
86 }
87
88 int main() {
89 // frame dimensions
90 const unsigned Width = 200;
91 const unsigned Height = 200;
92 // frames per second
93 const unsigned Fps = 10;
94 // video quality 0-63 (63-highest quality)
95 const int VideoQuality = 63;
96 const unsigned NumFrames = 30 * Fps; // 30-sec video
97
98 // this is the codec's internal state and context
99 theora_state theoraState;
100
101 // Info Header
102 theora_info encoderInfo;
103 theora_info_init(&encoderInfo);
104
105 encoderInfo.width = ((Width + 15) >>4)<<4; // encoder frame must
// be defined
106 encoderInfo.height = ((Height + 15)>>4)<<4; // in multiples of 16
107 encoderInfo.frame_width = Width;
108 encoderInfo.frame_height = Height;
109 encoderInfo.offset_x = 0;
110 encoderInfo.offset_y = 0;
111 encoderInfo.fps_numerator = Fps;
112 encoderInfo.fps_denominator = 1;
113 encoderInfo.aspect_numerator = Width;
114 encoderInfo.aspect_denominator = Height;
115 encoderInfo.colorspace = OC_CS_UNSPECIFIED;
116 encoderInfo.pixelformat = OC_PF_420;
117 encoderInfo.target_bitrate = 0;
118 encoderInfo.quality = VideoQuality;
119 encoderInfo.dropframes_p = 0;
120 encoderInfo.quick_p = 1;
121 encoderInfo.keyframe_auto_p = 1;
122 encoderInfo.keyframe_frequency = 64;
123 encoderInfo.keyframe_frequency_force = 64;
124 encoderInfo.keyframe_data_target_bitrate = 0;
125 encoderInfo.keyframe_mindistance = 8;
126 encoderInfo.noise_sensitivity = 1;
127
128 // initialize the codecs state
129 assert(0 == theora_encode_init(&theoraState, &encoderInfo));
130
131 // Comment Header
132 theora_comment tc;
133 theora_comment_init(&tc);
134 theora_comment_add(&tc,"AUTHOR=me");
135 theora_comment_add(&tc,"BRIEF=demo generated file");
136
137 FILE *fp = fopen("demo.ogg", "wb");
138 {
139 ogg_stream_state ogg_os;
140 ogg_packet op;
141 ogg_page og;
142
143 assert(0 == ogg_stream_init(&ogg_os, rand()));
144
145 // encode info header in data packet
146 theora_encode_header(&theoraState, &op);
147 ogg_stream_packetin(&ogg_os, &op);
148 if(ogg_stream_pageout(&ogg_os, &og)) {
149 fwrite(og.header, og.header_len, 1, fp);
150 fwrite(og.body, og.body_len, 1, fp);
151 }
152
153 // encode comment header in data packet
154 theora_encode_comment(&tc, &op);
155 ogg_stream_packetin(&ogg_os, &op);
156 if(ogg_stream_pageout(&ogg_os, &og)) {
157 fwrite(og.header, og.header_len, 1, fp);
158 fwrite(og.body, og.body_len, 1, fp);
159 }
160
161 // now encode the tables
162 theora_encode_tables(&theoraState, &op);
163 ogg_stream_packetin(&ogg_os, &op);
164 if(ogg_stream_pageout(&ogg_os, &og)) {
165 fwrite(og.header, og.header_len, 1, fp);
166 fwrite(og.body, og.body_len, 1, fp);
167 }
168
169 // flush all headers to packet
170 if(ogg_stream_flush(&ogg_os, &og)) {
171 fwrite(og.header, og.header_len, 1, fp);
172 fwrite(og.body, og.body_len, 1, fp);
173 }
174
175 for (int i=0; i<NumFrames; ++i) {
176 unsigned char yuv[Width*Height*3];
177 memset(yuv,0,Width*Height*3);
178 const double R = 255.0;
179 const double G = 255.0;
180 const double B = 255.0;
181 const double dY = (0.257 * R) + (0.504 * G) + (0.098 * B) + 16;
182 const double dU = -(0.148 * R) - (0.291 * G) + (0.439 * B) + 128;
183 const double dV = (0.439 * R) - (0.368 * G) - (0.071 * B) + 128;
184 const unsigned char Y = (unsigned char)dY;
185 const unsigned char U = (unsigned char)dU;
186 const unsigned char V = (unsigned char)dV;
187
188 // generate the frame background with specified color
189 {
190 unsigned i;
191 for(i = 0; i < Width*Height*3; i+=3) {
192 yuv[i] = Y;
193 yuv[i+1] = U;
194 yuv[i+2] = V;
195 }
196 }
197
198 // -- generate spinning dot of a foreground color --
199 {
200 const double R = 0.0;
201 const double G = 0.0;
202 const double B = 0.0;
203 const double dY = (0.257 * R) + (0.504 * G) + (0.098 * B) + 16;
204 const double dU = -(0.148 * R) - (0.291 * G) + (0.439 * B) + 128;
205 const double dV = (0.439 * R) - (0.368 * G) - (0.071 * B) + 128;
206 const unsigned char Y = (unsigned char)dY;
207 const unsigned char U = (unsigned char)dU;
208 const unsigned char V = (unsigned char)dV;
209 const unsigned cX = Width/2;
210 const unsigned cY = Height/2;
211 const unsigned Radius = 50;
212 static double theta = 0.0;
213 const unsigned x = (unsigned)(Radius * sin(theta) + cX);
214 const unsigned y = (unsigned)(Radius * cos(theta) + cY);
215 const unsigned k=3*(x + (Width*y));
216 theta -= 5.0 * 3.14159/180.0;
217 yuv[k] = Y;
218 yuv[k+1] = G;
219 yuv[k+2] = B;
220 }
221
222 writeFrame(fp,&theoraState,&ogg_os,Width, Height, yuv);
223 }
224
225 // prepare and close up the packet
226 theora_encode_packetout(&theoraState, 1, &op);
227 if(ogg_stream_pageout(&ogg_os, &og)) {
228 fwrite(og.header, og.header_len, 1, fp);
229 fwrite(og.body, og.body_len, 1, fp);
230 }
231
232 theora_info_clear(&encoderInfo);
233 theora_clear(&theoraState);
234
235 fflush(fp);
236 fclose(fp);
237 }
238 return EXIT_SUCCESS;
239 }

Since we are interested in line-by-line profiling, we will forgo the default report and generate a line-by-line profile report as follows:
$ gprof -l -b demo

The -b argument is to surpress some default output and generate a brief report; the output is as follows:
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
17.81 0.26 0.26 writeFrame (demo.c:61 @ 8048d54)
13.36 0.46 0.20 main (demo.c:191 @ 80495d5)
13.01 0.65 0.19 main (demo.c:177 @ 804943b)
11.64 0.81 0.17 writeFrame (demo.c:48 @ 8048cc4)
9.59 0.95 0.14 writeFrame (demo.c:60 @ 8048d7d)
6.16 1.04 0.09 writeFrame (demo.c:67 @ 8048da8)
5.48 1.12 0.08 main (demo.c:194 @ 80495c2)
5.48 1.21 0.08 writeFrame (demo.c:69 @ 8048de1)
4.11 1.26 0.06 main (demo.c:193 @ 80495b1)
3.42 1.31 0.05 main (demo.c:192 @ 80495a1)
3.42 1.36 0.05 writeFrame (demo.c:55 @ 8048d0b)
3.42 1.42 0.05 writeFrame (demo.c:53 @ 8048d2a)
1.37 1.44 0.02 writeFrame (demo.c:54 @ 8048cec)
0.68 1.45 0.01 writeFrame (demo.c:49 @ 8048cb4)
0.68 1.46 0.01 writeFrame (demo.c:52 @ 8048d36)
0.34 1.46 0.01 main (demo.c:200 @ 80495f3)
0.00 1.46 0.00 300 0.00 0.00 writeFrame (demo.c:12 @ 8048c04)


Call graph


granularity: each sample hit covers 4 byte(s) for 0.68% of 1.46 seconds

index % time self children called name
0.00 0.00 300/300 main (demo.c:222 @ 8049892) [121]
[17] 0.0 0.00 0.00 300 writeFrame (demo.c:12 @ 8048c04) [17]
-----------------------------------------------


Index by function name

[3] main (demo.c:177 @ 804943b) [17] writeFrame (demo.c:12 @ 8048c04) [15] writeFrame (demo.c:52 @ 8048d36)
[10] main (demo.c:192 @ 80495a1) [14] writeFrame (demo.c:49 @ 8048cb4) [1] writeFrame (demo.c:61 @ 8048d54)
[9] main (demo.c:193 @ 80495b1) [4] writeFrame (demo.c:48 @ 8048cc4) [5] writeFrame (demo.c:60 @ 8048d7d)
[7] main (demo.c:194 @ 80495c2) [13] writeFrame (demo.c:54 @ 8048cec) [6] writeFrame (demo.c:67 @ 8048da8)
[2] main (demo.c:191 @ 80495d5) [11] writeFrame (demo.c:55 @ 8048d0b) [8] writeFrame (demo.c:69 @ 8048de1)
[16] main (demo.c:200 @ 80495f3) [12] writeFrame (demo.c:53 @ 8048d2a)

The report summarizes a good deal of information about the application run but doesn't give a total execution time to determine overall performance gains. We can time the execution of the application for a few samples to calculate baseline performance statistics.
$ time ./demo

A series of 3 runs shows an average real application run-time of 1.6553 seconds; we'll define this as our base performance metric.

The report is ordered with respect to decreasing run time spent on each line. If we focus our attention on the heavy-hitters in decreasing order we'll focus our attention in one the following:
Run-Time Percentage
Line
17.81
61
13.36
191
13.01
177
11.64
48
9.59
60
6.16
67
5.48
194
5.48
69
4.11
193
3.42
192
3.42
55
3.42
53
1.37
54
0.68
49
0.68
52
0.34
200

Working our way down the candidate list we find lines 61, 191, 177 and so on we find optimization of these lines to be of a tough breed. Lines 48-49 and 52 however can readily be replaced with calls to memset. Lines 47-57 become

Timing our application once again shows an average of 1.481 seconds, showing a performance improvement of approximately 11%.

The default application encodes 300 frames, resulting in profiler metrics being collected for 300 calls. Profiler sample counts of 0.01 seconds implies that cumulative profiler measurements less than the sample count cannot be trusted. Increasing the number of iterations allows for cumulative results that exceed this sample count value and therefore increases our candidate optimization regions. Increasing the iterations to 30000 and rerunning profiler collection show a significantly different report on our updated source code.
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
16.77 18.20 18.20 writeFrame (demo.c:53 @ 8048cfa)
16.25 35.84 17.64 main (demo.c:183 @ 804957e)
14.96 52.08 16.24 main (demo.c:169 @ 80493e4)
11.15 64.18 12.11 writeFrame (demo.c:52 @ 8048d23)
7.59 72.42 8.24 writeFrame (demo.c:61 @ 8048d87)
6.84 79.84 7.42 writeFrame (demo.c:59 @ 8048d4e)
6.22 86.59 6.75 main (demo.c:186 @ 804956b)
4.99 92.02 5.42 writeFrame (demo.c:47 @ 8048ca3)
4.95 97.39 5.38 main (demo.c:185 @ 804955a)
2.47 100.07 2.68 main (demo.c:184 @ 804954a)
2.31 102.58 2.50 writeFrame (demo.c:58 @ 8048dc2)
1.32 104.00 1.43 writeFrame (demo.c:48 @ 8048cb8)
1.13 105.23 1.23 writeFrame (demo.c:49 @ 8048cd0)
0.82 106.12 0.89 writeFrame (demo.c:51 @ 8048d2e)
0.79 106.98 0.86 main (demo.c:183 @ 8049541)
0.74 107.79 0.81 writeFrame (demo.c:52 @ 8048cf1)
0.29 108.11 0.32 writeFrame (demo.c:57 @ 8048dce)
0.19 108.31 0.20 writeFrame (demo.c:58 @ 8048d45)
0.04 108.35 0.04 main (demo.c:167 @ 8049881)
0.03 108.38 0.03 writeFrame (demo.c:67 @ 8048e18)
0.02 108.41 0.03 main (demo.c:205 @ 80496b5)
0.02 108.42 0.02 writeFrame (demo.c:66 @ 8048dde)
0.01 108.44 0.01 main (demo.c:206 @ 804972e)
0.01 108.45 0.01 30000 333.33 333.33 writeFrame (demo.c:12 @ 8048c04)
0.01 108.46 0.01 main (demo.c:167 @ 804937f)
0.01 108.47 0.01 main (demo.c:170 @ 8049409)
0.01 108.48 0.01 main (demo.c:174 @ 8049460)
0.01 108.49 0.01 main (demo.c:178 @ 8049522)
0.01 108.50 0.01 main (demo.c:192 @ 804959c)
0.01 108.51 0.01 main (demo.c:196 @ 80495db)
0.01 108.52 0.01 writeFrame (demo.c:70 @ 8048e6c)
0.00 108.53 0.01 main (demo.c:176 @ 80494cc)
0.00 108.53 0.01 main (demo.c:177 @ 8049503)
0.00 108.53 0.01 main (demo.c:207 @ 80497a7)
0.00 108.54 0.01 main (demo.c:208 @ 80497bf)


Call graph


granularity: each sample hit covers 4 byte(s) for 0.01% of 108.54 seconds

index % time self children called name
0.01 0.00 30000/30000 main (demo.c:214 @ 804983b) [31]
[24] 0.0 0.01 0.00 30000 writeFrame (demo.c:12 @ 8048c04) [24]
-----------------------------------------------


Index by function name

[25] main (demo.c:167 @ 804937f) [29] main (demo.c:192 @ 804959c) [1] writeFrame (demo.c:53 @ 8048cfa)
[3] main (demo.c:169 @ 80493e4) [30] main (demo.c:196 @ 80495db) [4] writeFrame (demo.c:52 @ 8048d23)
[26] main (demo.c:170 @ 8049409) [21] main (demo.c:205 @ 80496b5) [14] writeFrame (demo.c:51 @ 8048d2e)
[27] main (demo.c:174 @ 8049460) [23] main (demo.c:206 @ 804972e) [18] writeFrame (demo.c:58 @ 8048d45)
[33] main (demo.c:176 @ 80494cc) [35] main (demo.c:207 @ 80497a7) [6] writeFrame (demo.c:59 @ 8048d4e)
[34] main (demo.c:177 @ 8049503) [36] main (demo.c:208 @ 80497bf) [5] writeFrame (demo.c:61 @ 8048d87)
[28] main (demo.c:178 @ 8049522) [19] main (demo.c:167 @ 8049881) [11] writeFrame (demo.c:58 @ 8048dc2)
[15] main (demo.c:183 @ 8049541) [24] writeFrame (demo.c:12 @ 8048c04) [17] writeFrame (demo.c:57 @ 8048dce)
[10] main (demo.c:184 @ 804954a) [8] writeFrame (demo.c:47 @ 8048ca3) [22] writeFrame (demo.c:66 @ 8048dde)
[9] main (demo.c:185 @ 804955a) [12] writeFrame (demo.c:48 @ 8048cb8) [20] writeFrame (demo.c:67 @ 8048e18)
[7] main (demo.c:186 @ 804956b) [13] writeFrame (demo.c:49 @ 8048cd0) [32] writeFrame (demo.c:70 @ 8048e6c)
[2] main (demo.c:183 @ 804957e) [16] writeFrame (demo.c:52 @ 8048cf1)

We can now concentrate our efforts on some new regions. Lines 170-178 are responsible for converting RGB values to YUV colorspace. A quick search for an alternative integer approximation may prove to increase our performance; namely:

unsigned char Y = (abs(R * 2104 +
G * 4130 +
B * 802 + 4096 + 131072) >> 13);
unsigned char U = (abs(R * -1214 +
G * -2384 +
B * 3598 + 4096 + 1048576) >> 13);
unsigned char V = (abs(R * 3598 +
G * -3013 +
B * -585 + 4096 + 1048576) >> 13);

Since we use this conversion in two areas in our code, the replacement is two-fold. Our new time measurements show an average execution time of 0.9403; a performance improvement of ~43% over our original application. This proves worthy of our hours worth of effort. We can repeat this cycle and wring out candidate regions until we are satisfied with our applications performance.

Conclusion

Profiling our application and focusing our optimization on program hotspots demonstrated a 43% performance increase for approximately an hours worth of effort. Application optimization should begin at the crucial stage of application design. A poorly designed application will never yield an optimized application, regardless of how many cycles you perform the profile/optimize cycle. A well-designed application can be profiled by using the Gprof utility and identified candidate regions can be concentrated on for optimization opportunities.

References

  1. http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_chapter/gprof_toc.html
  2. http://en.wikipedia.org/wiki/Performance_analysis