Sunday, May 12, 2019

Profiling with Gprof


I'm dusting off an old training paper I wrote some years back, still applicable and a bit of a change in pace.  Enjoy.

Brief

Profiling is a mechanism for identifying program 'hotspots'; areas where your program spends a good portion of it's time. These regions are prime candidates for optimizations, where concentration on efficiency will give you the most bang for your buck. Gprof is a GNU profiling tool available for a host of platforms and tightly integrated with the ever-so-popular Gcc compiler. This paper will walk through some basic information to utilize Gprof to identify program hotspots, optimize these regions and with-luck note the performance gains.

Introduction

Delivery of high-performance applications requires a critical step of evaluating the performance of the application. While care in design and implementation with concentration on optimization is essential in delivering high-performance applications profiling the source after implementation may identify regions where code optimization may show the greatest benefits. This paper will demonstrate the procedure for profiling an application, identify program 'hotspots' and optimize these hotspots with intent to show performance gains. The example application is an example produced for an independent course with no artificially introduced optimization candidates. With luck, at papers end the application will be optimized and demonstrate a performance gain.

Discussion

Profiling with Gprof consists of 3 phases: 1) compiling your application with profiling enabled, 2) executing the application to gather profiling metrics, and 3) evaluation of the collected metrics.

Compiling

Enabling profiling requires compiling your application with Gcc much as you'd normally do. The difference however is the introduction of two flags that you may not normally provide (-pg and -g). The -pg option enables profiling, the -g option introduces debugging symbols. The -g option is not too uncommon, but has particular significance when profiling as it provides the required debugging symbols for line-by-line profiling. The absence of the -g flag will result in the profiler gathering and reporting measures on a per function basis. Since we are interested in profiling on a line-by-line basis, we will provide both flags; as follows:
$ gcc -std=c99 -g -pg demo.c -ltheora -lm -o demo

Execution

Execution of your application takes the same form as your normally run it. You should specify the same command line arguments, your inputs and outputs are identical as that while running without profiling enabled. The only possibly observed difference is that your program may run slower as it acquires and tallies your profiling measures. It is worth noting that your application must terminate in a normal fashion, to include returning from the main() function, or via calling exit(), abrupt termination in a non-typical fashion will interrupt profiling and result in no profiling information collected.
The normal termination of your program will result in the profiler generating an output file (gmon.out) just prior to exiting. The location of this output file is generally in the current working directory, however if your application makes a habit of changing directories you may file this output file in the present working directory of your application prior to terminating. The absence of this output file hints at the lack of providing the -pg option during compilation, or the possibility of your application terminating in a non-normal manner.

$ ./demo

Evaluation

Now that your application has been compiled with profiling enabled, executed your application in a nominal fashion, and profiling metrics have been collected. The final step in profiling is the evaluation of the collected metrics. This is explicitly where the gprof utility is used. The default report style is done simply by executing gprof while specifying the executable as follows:
$ gprof demo
This command will evaluate the profiler metrics and generate a default profiler report. We have yet to see the example application source, as it becomes relevant it is listed as follows:


1 // example used for Theora quick survey paper
2 #include <stdio.h>
3 #include <stdlib.h>
4 #include <ogg/ogg.h>
5 #include <theora/theora.h>
6 #include <string.h>
7 #include <math.h>
8 #include <assert.h>
9
10 // this method encodes the specified frame and writes to the ogg file
11 void writeFrame( FILE *fp, theora_state *theoraState,
ogg_stream_state *ogg_os,
12 unsigned w, unsigned h, unsigned char *yuv) {
13 // write the frame
14 yuv_buffer yuv_buf;
15 ogg_packet op;
16 ogg_page og;
17
18 unsigned long yuv_w;
19 unsigned long yuv_h;
20
21 unsigned char *yuv_y;
22 unsigned char *yuv_u;
23 unsigned char *yuv_v;
24
25 unsigned int x;
26 unsigned int y;
27
28 /* Must hold: yuv_w >= w */
29 yuv_w = (w + 15) & ~15;
30
31 /* Must hold: yuv_h >= h */
32 yuv_h = (h + 15) & ~15;
33
34 yuv_y = (unsigned char*)malloc(yuv_w * yuv_h);
35 yuv_u = (unsigned char*)malloc(yuv_w * yuv_h / 4);
36 yuv_v = (unsigned char*)malloc(yuv_w * yuv_h / 4);
37 yuv_buf.y_width = yuv_w;
38 yuv_buf.y_height = yuv_h;
39 yuv_buf.y_stride = yuv_w;
40 yuv_buf.uv_width = yuv_w >> 1;
41 yuv_buf.uv_height = yuv_h >> 1;
42 yuv_buf.uv_stride = yuv_w >> 1;
43 yuv_buf.y = yuv_y;
44 yuv_buf.u = yuv_u;
45 yuv_buf.v = yuv_v;
46
47 for(y = 0; y < yuv_h; y++) {
48 for(x = 0; x < yuv_w; x++) {
49 yuv_y[x + y * yuv_w] = 0;
50 }
51 }
52 for(y = 0; y < yuv_h; y += 2) {
53 for(x = 0; x < yuv_w; x += 2) {
54 yuv_u[(x >> 1) + (y >> 1) * (yuv_w >> 1)] = 0;
55 yuv_v[(x >> 1) + (y >> 1) * (yuv_w >> 1)] = 0;
56 }
57 }
58
59 for(y = 0; y < h; y++) {
60 for(x = 0; x < w; x++) {
61 yuv_y[x + y * yuv_w] = yuv[3 * (x + y * w) + 0];
62 }
63 }
64
65 for(y = 0; y < h; y += 2) {
66 for(x = 0; x < w; x += 2) {
67 yuv_u[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
68 yuv[3 * (x + y * w) + 1];
69 yuv_v[(x >> 1) + (y >> 1) * (yuv_w >> 1)] =
70 yuv[3 * (x + y * w) + 2];
71 }
72 }
73
74 assert(0 == theora_encode_YUVin(theoraState, &yuv_buf));
75 assert(0 != theora_encode_packetout(theoraState, 0, &op));
76
77 ogg_stream_packetin(ogg_os, &op);
78 if(ogg_stream_pageout(ogg_os, &og)) {
79 fwrite(og.header, og.header_len, 1, fp);
80 fwrite(og.body, og.body_len, 1, fp);
81 }
82
83 free(yuv_y);
84 free(yuv_u);
85 free(yuv_v);
86 }
87
88 int main() {
89 // frame dimensions
90 const unsigned Width = 200;
91 const unsigned Height = 200;
92 // frames per second
93 const unsigned Fps = 10;
94 // video quality 0-63 (63-highest quality)
95 const int VideoQuality = 63;
96 const unsigned NumFrames = 30 * Fps; // 30-sec video
97
98 // this is the codec's internal state and context
99 theora_state theoraState;
100
101 // Info Header
102 theora_info encoderInfo;
103 theora_info_init(&encoderInfo);
104
105 encoderInfo.width = ((Width + 15) >>4)<<4; // encoder frame must
// be defined
106 encoderInfo.height = ((Height + 15)>>4)<<4; // in multiples of 16
107 encoderInfo.frame_width = Width;
108 encoderInfo.frame_height = Height;
109 encoderInfo.offset_x = 0;
110 encoderInfo.offset_y = 0;
111 encoderInfo.fps_numerator = Fps;
112 encoderInfo.fps_denominator = 1;
113 encoderInfo.aspect_numerator = Width;
114 encoderInfo.aspect_denominator = Height;
115 encoderInfo.colorspace = OC_CS_UNSPECIFIED;
116 encoderInfo.pixelformat = OC_PF_420;
117 encoderInfo.target_bitrate = 0;
118 encoderInfo.quality = VideoQuality;
119 encoderInfo.dropframes_p = 0;
120 encoderInfo.quick_p = 1;
121 encoderInfo.keyframe_auto_p = 1;
122 encoderInfo.keyframe_frequency = 64;
123 encoderInfo.keyframe_frequency_force = 64;
124 encoderInfo.keyframe_data_target_bitrate = 0;
125 encoderInfo.keyframe_mindistance = 8;
126 encoderInfo.noise_sensitivity = 1;
127
128 // initialize the codecs state
129 assert(0 == theora_encode_init(&theoraState, &encoderInfo));
130
131 // Comment Header
132 theora_comment tc;
133 theora_comment_init(&tc);
134 theora_comment_add(&tc,"AUTHOR=me");
135 theora_comment_add(&tc,"BRIEF=demo generated file");
136
137 FILE *fp = fopen("demo.ogg", "wb");
138 {
139 ogg_stream_state ogg_os;
140 ogg_packet op;
141 ogg_page og;
142
143 assert(0 == ogg_stream_init(&ogg_os, rand()));
144
145 // encode info header in data packet
146 theora_encode_header(&theoraState, &op);
147 ogg_stream_packetin(&ogg_os, &op);
148 if(ogg_stream_pageout(&ogg_os, &og)) {
149 fwrite(og.header, og.header_len, 1, fp);
150 fwrite(og.body, og.body_len, 1, fp);
151 }
152
153 // encode comment header in data packet
154 theora_encode_comment(&tc, &op);
155 ogg_stream_packetin(&ogg_os, &op);
156 if(ogg_stream_pageout(&ogg_os, &og)) {
157 fwrite(og.header, og.header_len, 1, fp);
158 fwrite(og.body, og.body_len, 1, fp);
159 }
160
161 // now encode the tables
162 theora_encode_tables(&theoraState, &op);
163 ogg_stream_packetin(&ogg_os, &op);
164 if(ogg_stream_pageout(&ogg_os, &og)) {
165 fwrite(og.header, og.header_len, 1, fp);
166 fwrite(og.body, og.body_len, 1, fp);
167 }
168
169 // flush all headers to packet
170 if(ogg_stream_flush(&ogg_os, &og)) {
171 fwrite(og.header, og.header_len, 1, fp);
172 fwrite(og.body, og.body_len, 1, fp);
173 }
174
175 for (int i=0; i<NumFrames; ++i) {
176 unsigned char yuv[Width*Height*3];
177 memset(yuv,0,Width*Height*3);
178 const double R = 255.0;
179 const double G = 255.0;
180 const double B = 255.0;
181 const double dY = (0.257 * R) + (0.504 * G) + (0.098 * B) + 16;
182 const double dU = -(0.148 * R) - (0.291 * G) + (0.439 * B) + 128;
183 const double dV = (0.439 * R) - (0.368 * G) - (0.071 * B) + 128;
184 const unsigned char Y = (unsigned char)dY;
185 const unsigned char U = (unsigned char)dU;
186 const unsigned char V = (unsigned char)dV;
187
188 // generate the frame background with specified color
189 {
190 unsigned i;
191 for(i = 0; i < Width*Height*3; i+=3) {
192 yuv[i] = Y;
193 yuv[i+1] = U;
194 yuv[i+2] = V;
195 }
196 }
197
198 // -- generate spinning dot of a foreground color --
199 {
200 const double R = 0.0;
201 const double G = 0.0;
202 const double B = 0.0;
203 const double dY = (0.257 * R) + (0.504 * G) + (0.098 * B) + 16;
204 const double dU = -(0.148 * R) - (0.291 * G) + (0.439 * B) + 128;
205 const double dV = (0.439 * R) - (0.368 * G) - (0.071 * B) + 128;
206 const unsigned char Y = (unsigned char)dY;
207 const unsigned char U = (unsigned char)dU;
208 const unsigned char V = (unsigned char)dV;
209 const unsigned cX = Width/2;
210 const unsigned cY = Height/2;
211 const unsigned Radius = 50;
212 static double theta = 0.0;
213 const unsigned x = (unsigned)(Radius * sin(theta) + cX);
214 const unsigned y = (unsigned)(Radius * cos(theta) + cY);
215 const unsigned k=3*(x + (Width*y));
216 theta -= 5.0 * 3.14159/180.0;
217 yuv[k] = Y;
218 yuv[k+1] = G;
219 yuv[k+2] = B;
220 }
221
222 writeFrame(fp,&theoraState,&ogg_os,Width, Height, yuv);
223 }
224
225 // prepare and close up the packet
226 theora_encode_packetout(&theoraState, 1, &op);
227 if(ogg_stream_pageout(&ogg_os, &og)) {
228 fwrite(og.header, og.header_len, 1, fp);
229 fwrite(og.body, og.body_len, 1, fp);
230 }
231
232 theora_info_clear(&encoderInfo);
233 theora_clear(&theoraState);
234
235 fflush(fp);
236 fclose(fp);
237 }
238 return EXIT_SUCCESS;
239 }

Since we are interested in line-by-line profiling, we will forgo the default report and generate a line-by-line profile report as follows:
$ gprof -l -b demo

The -b argument is to surpress some default output and generate a brief report; the output is as follows:
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
17.81 0.26 0.26 writeFrame (demo.c:61 @ 8048d54)
13.36 0.46 0.20 main (demo.c:191 @ 80495d5)
13.01 0.65 0.19 main (demo.c:177 @ 804943b)
11.64 0.81 0.17 writeFrame (demo.c:48 @ 8048cc4)
9.59 0.95 0.14 writeFrame (demo.c:60 @ 8048d7d)
6.16 1.04 0.09 writeFrame (demo.c:67 @ 8048da8)
5.48 1.12 0.08 main (demo.c:194 @ 80495c2)
5.48 1.21 0.08 writeFrame (demo.c:69 @ 8048de1)
4.11 1.26 0.06 main (demo.c:193 @ 80495b1)
3.42 1.31 0.05 main (demo.c:192 @ 80495a1)
3.42 1.36 0.05 writeFrame (demo.c:55 @ 8048d0b)
3.42 1.42 0.05 writeFrame (demo.c:53 @ 8048d2a)
1.37 1.44 0.02 writeFrame (demo.c:54 @ 8048cec)
0.68 1.45 0.01 writeFrame (demo.c:49 @ 8048cb4)
0.68 1.46 0.01 writeFrame (demo.c:52 @ 8048d36)
0.34 1.46 0.01 main (demo.c:200 @ 80495f3)
0.00 1.46 0.00 300 0.00 0.00 writeFrame (demo.c:12 @ 8048c04)


Call graph


granularity: each sample hit covers 4 byte(s) for 0.68% of 1.46 seconds

index % time self children called name
0.00 0.00 300/300 main (demo.c:222 @ 8049892) [121]
[17] 0.0 0.00 0.00 300 writeFrame (demo.c:12 @ 8048c04) [17]
-----------------------------------------------


Index by function name

[3] main (demo.c:177 @ 804943b) [17] writeFrame (demo.c:12 @ 8048c04) [15] writeFrame (demo.c:52 @ 8048d36)
[10] main (demo.c:192 @ 80495a1) [14] writeFrame (demo.c:49 @ 8048cb4) [1] writeFrame (demo.c:61 @ 8048d54)
[9] main (demo.c:193 @ 80495b1) [4] writeFrame (demo.c:48 @ 8048cc4) [5] writeFrame (demo.c:60 @ 8048d7d)
[7] main (demo.c:194 @ 80495c2) [13] writeFrame (demo.c:54 @ 8048cec) [6] writeFrame (demo.c:67 @ 8048da8)
[2] main (demo.c:191 @ 80495d5) [11] writeFrame (demo.c:55 @ 8048d0b) [8] writeFrame (demo.c:69 @ 8048de1)
[16] main (demo.c:200 @ 80495f3) [12] writeFrame (demo.c:53 @ 8048d2a)

The report summarizes a good deal of information about the application run but doesn't give a total execution time to determine overall performance gains. We can time the execution of the application for a few samples to calculate baseline performance statistics.
$ time ./demo

A series of 3 runs shows an average real application run-time of 1.6553 seconds; we'll define this as our base performance metric.

The report is ordered with respect to decreasing run time spent on each line. If we focus our attention on the heavy-hitters in decreasing order we'll focus our attention in one the following:
Run-Time Percentage
Line
17.81
61
13.36
191
13.01
177
11.64
48
9.59
60
6.16
67
5.48
194
5.48
69
4.11
193
3.42
192
3.42
55
3.42
53
1.37
54
0.68
49
0.68
52
0.34
200

Working our way down the candidate list we find lines 61, 191, 177 and so on we find optimization of these lines to be of a tough breed. Lines 48-49 and 52 however can readily be replaced with calls to memset. Lines 47-57 become

Timing our application once again shows an average of 1.481 seconds, showing a performance improvement of approximately 11%.

The default application encodes 300 frames, resulting in profiler metrics being collected for 300 calls. Profiler sample counts of 0.01 seconds implies that cumulative profiler measurements less than the sample count cannot be trusted. Increasing the number of iterations allows for cumulative results that exceed this sample count value and therefore increases our candidate optimization regions. Increasing the iterations to 30000 and rerunning profiler collection show a significantly different report on our updated source code.
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
16.77 18.20 18.20 writeFrame (demo.c:53 @ 8048cfa)
16.25 35.84 17.64 main (demo.c:183 @ 804957e)
14.96 52.08 16.24 main (demo.c:169 @ 80493e4)
11.15 64.18 12.11 writeFrame (demo.c:52 @ 8048d23)
7.59 72.42 8.24 writeFrame (demo.c:61 @ 8048d87)
6.84 79.84 7.42 writeFrame (demo.c:59 @ 8048d4e)
6.22 86.59 6.75 main (demo.c:186 @ 804956b)
4.99 92.02 5.42 writeFrame (demo.c:47 @ 8048ca3)
4.95 97.39 5.38 main (demo.c:185 @ 804955a)
2.47 100.07 2.68 main (demo.c:184 @ 804954a)
2.31 102.58 2.50 writeFrame (demo.c:58 @ 8048dc2)
1.32 104.00 1.43 writeFrame (demo.c:48 @ 8048cb8)
1.13 105.23 1.23 writeFrame (demo.c:49 @ 8048cd0)
0.82 106.12 0.89 writeFrame (demo.c:51 @ 8048d2e)
0.79 106.98 0.86 main (demo.c:183 @ 8049541)
0.74 107.79 0.81 writeFrame (demo.c:52 @ 8048cf1)
0.29 108.11 0.32 writeFrame (demo.c:57 @ 8048dce)
0.19 108.31 0.20 writeFrame (demo.c:58 @ 8048d45)
0.04 108.35 0.04 main (demo.c:167 @ 8049881)
0.03 108.38 0.03 writeFrame (demo.c:67 @ 8048e18)
0.02 108.41 0.03 main (demo.c:205 @ 80496b5)
0.02 108.42 0.02 writeFrame (demo.c:66 @ 8048dde)
0.01 108.44 0.01 main (demo.c:206 @ 804972e)
0.01 108.45 0.01 30000 333.33 333.33 writeFrame (demo.c:12 @ 8048c04)
0.01 108.46 0.01 main (demo.c:167 @ 804937f)
0.01 108.47 0.01 main (demo.c:170 @ 8049409)
0.01 108.48 0.01 main (demo.c:174 @ 8049460)
0.01 108.49 0.01 main (demo.c:178 @ 8049522)
0.01 108.50 0.01 main (demo.c:192 @ 804959c)
0.01 108.51 0.01 main (demo.c:196 @ 80495db)
0.01 108.52 0.01 writeFrame (demo.c:70 @ 8048e6c)
0.00 108.53 0.01 main (demo.c:176 @ 80494cc)
0.00 108.53 0.01 main (demo.c:177 @ 8049503)
0.00 108.53 0.01 main (demo.c:207 @ 80497a7)
0.00 108.54 0.01 main (demo.c:208 @ 80497bf)


Call graph


granularity: each sample hit covers 4 byte(s) for 0.01% of 108.54 seconds

index % time self children called name
0.01 0.00 30000/30000 main (demo.c:214 @ 804983b) [31]
[24] 0.0 0.01 0.00 30000 writeFrame (demo.c:12 @ 8048c04) [24]
-----------------------------------------------


Index by function name

[25] main (demo.c:167 @ 804937f) [29] main (demo.c:192 @ 804959c) [1] writeFrame (demo.c:53 @ 8048cfa)
[3] main (demo.c:169 @ 80493e4) [30] main (demo.c:196 @ 80495db) [4] writeFrame (demo.c:52 @ 8048d23)
[26] main (demo.c:170 @ 8049409) [21] main (demo.c:205 @ 80496b5) [14] writeFrame (demo.c:51 @ 8048d2e)
[27] main (demo.c:174 @ 8049460) [23] main (demo.c:206 @ 804972e) [18] writeFrame (demo.c:58 @ 8048d45)
[33] main (demo.c:176 @ 80494cc) [35] main (demo.c:207 @ 80497a7) [6] writeFrame (demo.c:59 @ 8048d4e)
[34] main (demo.c:177 @ 8049503) [36] main (demo.c:208 @ 80497bf) [5] writeFrame (demo.c:61 @ 8048d87)
[28] main (demo.c:178 @ 8049522) [19] main (demo.c:167 @ 8049881) [11] writeFrame (demo.c:58 @ 8048dc2)
[15] main (demo.c:183 @ 8049541) [24] writeFrame (demo.c:12 @ 8048c04) [17] writeFrame (demo.c:57 @ 8048dce)
[10] main (demo.c:184 @ 804954a) [8] writeFrame (demo.c:47 @ 8048ca3) [22] writeFrame (demo.c:66 @ 8048dde)
[9] main (demo.c:185 @ 804955a) [12] writeFrame (demo.c:48 @ 8048cb8) [20] writeFrame (demo.c:67 @ 8048e18)
[7] main (demo.c:186 @ 804956b) [13] writeFrame (demo.c:49 @ 8048cd0) [32] writeFrame (demo.c:70 @ 8048e6c)
[2] main (demo.c:183 @ 804957e) [16] writeFrame (demo.c:52 @ 8048cf1)

We can now concentrate our efforts on some new regions. Lines 170-178 are responsible for converting RGB values to YUV colorspace. A quick search for an alternative integer approximation may prove to increase our performance; namely:

unsigned char Y = (abs(R * 2104 +
G * 4130 +
B * 802 + 4096 + 131072) >> 13);
unsigned char U = (abs(R * -1214 +
G * -2384 +
B * 3598 + 4096 + 1048576) >> 13);
unsigned char V = (abs(R * 3598 +
G * -3013 +
B * -585 + 4096 + 1048576) >> 13);

Since we use this conversion in two areas in our code, the replacement is two-fold. Our new time measurements show an average execution time of 0.9403; a performance improvement of ~43% over our original application. This proves worthy of our hours worth of effort. We can repeat this cycle and wring out candidate regions until we are satisfied with our applications performance.

Conclusion

Profiling our application and focusing our optimization on program hotspots demonstrated a 43% performance increase for approximately an hours worth of effort. Application optimization should begin at the crucial stage of application design. A poorly designed application will never yield an optimized application, regardless of how many cycles you perform the profile/optimize cycle. A well-designed application can be profiled by using the Gprof utility and identified candidate regions can be concentrated on for optimization opportunities.

References

  1. http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html_chapter/gprof_toc.html
  2. http://en.wikipedia.org/wiki/Performance_analysis

Monday, May 6, 2019

Intellectual Wrestling When Contemplating Leaving a Company



Spending the past decades in a career focused on logic and reason it's disappointing and perhaps eye-opening to observe reasonable people not making reasonable decisions.  As much as I'd like to mount my particularly high horse and proclaim from above how everyone should behave, self-reflection would only exhibit hypocracy as I'm guilty of some of the same short-comings.  I'm still a pup in the study of organizational dynamics so I offer no solutions, only present observations specifically focused on justifying leaving a team which has a corrosive element.  By no means should this be perceived as advocating leaving your team, rather it should be considered reasons you could consider leaving and remedy guilt you may feel in doing so.

Perhaps sadly, long gone are the days of long-term employee/employer relationships.  While more common in past decades, I'm constantly surprised when encountering employee anniversaries on the 15+ year mark.  While admirable, it's far more common for 5-10 year engagements in my observations, but I have no data to back up such a claim.  The point is, company/employee loyalty or long-standing relations are less common than years ago and there is no societal judgement in leaving a job for another position.  In fact, in my opinion, one of the worst things for anyone is to feel or be trapped to an employer/employee.  The employment relationship should be mutually beneficial, if not it's time to consider splitting ties.

I'm getting a bit off-topic so let me reel this in.  Suppose you're currently working on a team that is mostly good, but has some strong elements that makes it frequently suck.  For example, suppose you like your manager, really like most of your team members, but has one (or a few) people make it miserable. Now what?

You're at a cross-roads; stay or leave?

There are strong merits to staying and participating in changing the culture and as admirable this path is, I won't be focusing on it; leaving it to countless other experts.  Stay or go; it's a personal decision, one you need to make for yourself.  Having struggled with this decision numerous times myself and seeing colleagues around me go through it as well it's a pretty common plight.  I thought I'd share some of my thoughts on what may go through your mind and provide some counter-points to common concerns and feelings of guilt.

Feeling of Guilt Leaving the Team

Hopefully, there are team members that you genuinely enjoy working with.  When this is the case I've often struggled with a feeling of abandonment, leaving your good team members to deal with the bad ones.  A feeling that you should remain to assist in changing the team for the better.  Despite the 'fork in the road' analogy the choice to leave can participate in changing the culture.  Pain is a primary motivation, easily observed if you've ever encountered a freak hailstorm on a beautiful sunny day.  Despite your overwhelming desire to stay outdoors, being pelted by hail motivates you to change your current situation.  This goes for you as well as the organization.  Leaving a position, unless it's irrelevant, causes pain to your team, leadership, and organization at many levels.  Project schedules and risk will be impacted which will/should be relayed up the leadership chain.  Time and cost to locate a replacement team member will hit your hiring manager as well as human resources.  Training costs, formal or informal, will impact the organization.  In the case of loss of an employee, rather than a contractor, often a 20% salary finders fee will be accrued, a pretty common fee associated with recruiters.  Loss of a team member is a pretty visible event, while there are plenty of organizations that see this as simply a common thing, you NEVER want to work for a organization that trivializes employees leaving as a non-event.

If you hate the organization you're leaving and everyone associated with it you'll unlikely experience any feelings of remorse, but what if you genuinely like some of the people that will be recipients of the impacts?

What if you like your manager, aren't you hurting them?  In a word, yes; but you may also be helping them.  If a caustic team member is driving away good people it may motivate them to remove the bad element rather than feel everyone will simply deal.  Even the strongest managers struggle with removing an employee for any number of reasons; hoping in time they will improve, a desire to avoid conflict, avoiding the time/cost to locate and train a replacement, a hope in time they'll leave on their own or a desire to avoid the crap-ton of red-tape in documenting and executing an employee dismissal.  Sometimes it's not a shortcoming of your manager, perhaps the caustic employee works for another team.  Like a neighbors dog, often the caustic team member is nice to his/her immediate family but bites everyone else.  The neighbor will strongly desire to keep their family member until it's been demonstrated it bites someone, only then are they sufficiently motivated to address the issue. Leaving an organization is a public event and formally or informally the reason you choose to leave will make it to decision makers capable of addressing the root cause.  If you have a dick working at the company that is driving away talented people it's simply a matter of time before people who can remedy the situation are in the know.  Regardless of how instrumental they may be to the company, a repeated cost of hundreds of thousands of dollars of driving away good people will raise them into a stage they want to avoid.  A company earning a reputation of being a revolving door of hiring is less tangible, but certainly one most companies will avoid.

What About My Team?

Regardless of how good you are, it's incredibly arrogant to ever think your team can't live without you.  Will they be affected?  Absolutely. Short-term they will absolutely encounter pain and likely through no fault of their own.  In my opinion, this will likely be the heaviest burden of guilt you may feel when deciding to leave.  Call it career survivors guilt, understandable but ill-placed.  Long-term, your leaving may contribute more to changing the culture than staying and fighting windmills for years.  It opens a door to find a organization that has a better culture and keeping an eye out for opportunities for your team to join you in the future.

What About My Reputation?

Maybe you're concerned that leaving when 'times got tough' will earn you a quitter scarlet letter.  Having been on both sides of hiring, despite what your high school principle may have informed you....you have no 'permanent record'.  Unless you live in an incredibly small community that is subject to gossip, the reasons you leave an organization are pretty confidential.  The company you left will likely know, and future employment will only know what you share.  Employment references are less common than days of past, and when given are often heavily restricted.  Your past managers sharing the reason you left puts the company in jeopardy of liability.

What If I Can't Find Another Job?

Risk of not being able to find another job may be an issue, but if you're good at what you do and the market is accommodating I've always been confident in finding another position.  You can remedy this concern by finding a position before tenuring your resignation.  Timing of leaving a position for another should be considered, and while likely industry-dependent....avoid leaving near the end of the year when hiring efforts and financials slow.

In Closing

I hope you found some of this useful.  Feel free to share your own questions or opinions on the matter.  Keep in mind, this isn't advocating leaving as the only solution, rather hopefully it provides insight into how leaving a company can long-term benefit a company as well as counter-points to feelings of guilt you may feel when struggling with this decision.  Having worked for a variety of companies these past decades, as an contributor, a technical lead and part of management I feel I have a wide angle lens on employee hiring/leaving practices.

Final note; I've worked for a variety of organizations and have never found one to be inherently evil.  I've never met a Montgomery Burns who treats employees like replaceable cogs and doesn't want to address organizational issues.  Most people want to do the right thing, sometimes it takes a sharp kick in the pants to motivate them to do so.  Reputation and cost are universal motivators to organizations and may provide the tipping point to fix what ails them.