It’s an interesting thing about tools, when they deliver what you need from them you’re often uninterested in ’how the sausage is made’, but digging into the details often re-enforces your understanding in the end. Kinda like eating your broccoli, its oftentimes good for you, and likely you’ll be better off having done it.
There is just shy of a bazillion things you can learn about FFmpeg and the video/audio domain, we’re going to spend just a little bit of time trying to understand some of the details readily available to us and hopefully understand the tooling and domain just a little bit more than when we started.
Saddle up, grab a beer and ’read on’ fellow digital cowboy. FFmpeg typically comes paired with a useful utility called ffprobe, a media prober, which we’ll use to examine media files and pull out interesting nuggets of information.
FFprobe, like FFmpeg, is pretty verbose when run writing a ton of debug information to stderr. This proves useful when it’s needed, but burdensome when not. For our uses we will quiet the utilities down by specifying -loglevel quiet.
Let’s start by examining our media files container.
$ ffprobe - loglevel quiet - show_format BigBuckBunny . mp4
[ FORMAT ]
filename = BigBuckBunny . mp4
nb_streams =2
nb_programs =0
format_name = matroska , webm
format_long_name = Matroska / WebM
start_time = -0.007000
duration =596.501000
size =107903686
bit_rate =1447155
probe_score =100
TAG : C OMPATI BLE_BR ANDS = iso6avc1mp41
TAG : MAJOR_BRAND = dash
TAG : MINOR_VERSION =0
TAG : ENCODER = Lavf56 .40.101
[/ FORMAT ]
As you’re likely aware, a media container is simply a file that contains the video(s), audio(s), and subtitle(s). Media-wide properties, like file size, tags, length....are often available as well as user-defined tags (like GPS, date,...). By default, show format will show all properties of the media container.Sometimes, you may wish to limit the fields to ones you’re particularly interested in, like duration and size. You’ll notice the user-tags are displayed despite not being specified, i’ve not found a way to suppress them directly so they can simply be ignored.
$ ffprobe - loglevel quiet - show_format BigBuckBunny .
mp4 - show_entries format = duration , size
[ FORMAT ]
duration =596.501000
size =107903686
TAG : COMPATIBLE_BRANDS = iso6avc1mp41
TAG : MAJOR_BRAND = dash
TAG : MINOR_VERSION =0
TAG : ENCODER = Lavf56 .40.101
[/ FORMAT ]
Cool, but not particularly interesting, and really nothing a filemanager couldn’t show you with a simple right-click.
Let’s dig a bit deeper by examining the video/audio frames. By specifying the show frames we can extract debug information for each frame in the media file. Let’s peek at the first few dozen lines.
$ ffprobe - loglevel quiet - show_frames BigBuckBunny .
mp4
[ FRAME ]
media_type = video
stream_index =0
key_frame =1
pkt_pts =0
pkt_pts_time =0.000000
pkt_dts =0
pkt_dts_time =0.000000
best_effort_timestamp =0
best_effort_timestamp_time =0.000000
pkt_duration =41
pkt_durat ion_time =0.041000
pkt_pos =1111
pkt_size =208
width =1280
height =720
pix_fmt = yuv420p
sample_aspect_ratio =1:1
pict_type = I
coded_picture_number =0
display_picture_number =0
interlaced_frame =0
top_field_first =0
repeat_pict =0
color_range = unknown
color_space = unknown
color_primaries = unknown
color_transfer = unknown
chroma_location = left
[/FRAME ]
[ FRAME ]
media_type = audio
stream_index =1
key_frame =1
pkt_pts =0
pkt_pts_time =0.000000
pkt_dts =0
pkt_dts_time =0.000000
best_effort_timestamp = -7
best_effort_timestamp_time = -0.007000
pkt_duration =13
pkt_duration_time =0.013000
pkt_pos =1368
pkt_size =3
sample_fmt = fltp
nb_samples =648
channels =2
channel_layout = stereo
[/ FRAME ]
Notice that this snippet contains two frames, one video and one audio, each
with a set of media-specific fields. Collectively, we’re left with the follow-
ing collection of fields: best effort timestamp, best effort timestamp time, chan-
nel layout, channels, chroma location, coded picture number, color primaries,
color range, color space, color transfer, display picture number, height, inter-
laced frame, key frame, media type, nb samples, pict type, pix fmt, pkt dts, pkt dts time,
pkt duration, pkt duration time, pkt pos, pkt pts, pkt pts time, pkt size, repeat pict,
sample aspect ratio, sample fmt, stream index, top field first, width.
A dilligent and motivated reader could spend time investigating each field, but I’m more of a pass/fail kinda guy so we’ll limit our interest in a few relevant fields and briefly discuss the relevance of others.
Each packet specifies a media type, audio or video. Let’s focus on video frames for now, we can select only video streams to simplify our review.
$ ffprobe - loglevel quiet - select_streams V -
show_frames BigBuckBunny . mp4[ FRAME ]
media_type = video
stream_index =0
key_frame =1
pkt_pts =0
pkt_pts_time =0.000000
pkt_dts =0
pkt_dts_time =0.000000
best_effort_timestamp =0
best_effort_timestamp_time =0.000000
pkt_duration =41
pkt_durat ion_ti me =0.041000
pkt_pos =1111
pkt_size =208
width =1280
height =720
pix_fmt = yuv420p
sample_aspect_ratio =1:1
pict_type = I
coded_picture_number =0
display_picture_number =0
interlaced_frame =0
top_field_first =0
repeat_pict =0
color_range = unknown
color_space = unknown
color_primaries = unknown
color_transfer = unknown
chroma_location = left
[/ FRAME ]
[ FRAME ]
media_type = video
stream_index =0
key_frame =0
pkt_pts =42
pkt_pts_time =0.042000
pkt_dts =42
pkt_dts_time =0.042000
best_effort_timestamp =42
best_effort_timestamp_time =0.042000
pkt_duration =41
pkt_duration_time =0.041000
pkt_pos =1325
pkt_size =37
width =1280
height =720pix_fmt = yuv420p
sample_aspect_ratio =1:1
pict_type = P
Packet Fields You’ll notice there are a number of packet-wise fields (8 specifically), remember even though we are inspecting a file many video/audio protocols support streaming and are more relevant for such purposes. Despite being named with packet-prefix/suffix, they are often relevant for files as well, so don’t simply disregard.
PictureType Field The pict type field can be particularly interesting for those interested in video compression. Video picture types are often referred to as I-frames, B-frames, or P-frames; each having to do with video compression. I-frames are modestly compressed and are considered self-contained, not requiring other frames to decode. P-frames and B-frames (bi-directional) however employ a higher level of compression by capitalizing on similarity with the previous or following frames. For example, rather than compress an entire video frame, if two video frames are similar differing slightly in specific regions we can focus our compression/storage to the regions of change and greatly increase our com-
pression as a result. That’s precisely the relevance of P-frames and B-frames. P-frames use data from the previous frame, storing/compressing what’s different rather than the entire frame. B-frames extend on this by utilizing the previous frame and the following frame. Pretty neat, huh?
Timestamp Fields Two particularly interesting fields are decoding time stamp (DTS) and presentation time stamp (PTS). These timestamps are particularly interesting when you wish to modify the playback speed of a video. The presentation time stamp (PTS) indicates at what respective time the frame should
be ’presented’, or displayed. At 3 minutes, 30.01 seconds into the movie, what frame(s) should pop up for the viewer? Adjusting the PTS of a file therefore can shift, speed up/down or simply alter when the frame is presented. Halving the PTS will speed up a video, doubling the PTS will slow it down. Relatively simple.
The decoding time stamp (DTS) however is often identical (or similar) to the PTS, but not necessarily. Why would we possibly need yet another timestamp? It all comes back to compression, let’s say a sequence of video frames come in the form of I-frames, P-frames and B-frames: I P B B... The I-frame is self contained, the following P-frame (which is dependent on the previous frame) can rely on the previous frame being decompressed before hand (because the previous frame PTS ¡ current frame PTS), but B-frames throw a wrench into the mix. B-frames are reliant on the previous and the next frame, so both those frames must be decompressed before the B-frame can be decompressed. As a general rule, PTS and DTS time stamps tend to only differ when a stream has B-frames in it. The first 30 frames, roughly the first second of our video, are aseries of I,P,B frames.
$ ffprobe -loglevel quiet -select_streams V -show_frames -show_entries frame=pict_type BigBuckBunny.mp4
[ FRAME ]
pict_type = I
[/ FRAME ]
[ FRAME ]
pict_type = P
[/ FRAME ]
[ FRAME ]
pict_type = P
[/ FRAME ]
[ FRAME ]
pict_type = P
[/ FRAME ]
[ FRAME ]
pict_type = P
[/ FRAME ]
[ FRAME ]
pict_type = B
[/ FRAME ]
[ FRAME ]
pict_type = B
[/ FRAME ]
[ FRAME ]
pict_type = P
[/ FRAME ]
[ FRAME ]
pict_type = B
[/ FRAME ]
[ FRAME ]
pict_type = P
[/ FRAME ]
Lovely, right?
Another pair of timestamps, a bit less relevant, are ’best effort’ timestamps (best effort timestamp,best effort timestamp time). These tend to only be relevant for streams that only specify a DTS timestamp (e.g. no PTS). Literally attempting to provide a guess for a PTS-like value, enforcing a monotonicly
increasing timestamp) derived from available timestamp values.
Video Size Fields So riddle me this, why does each video stream have width/height fields? Wouldn’t that be better suited in the container? One, uniformly sized video file, right? Nope. A container often contains a number of audio tracks (alternative languages, director commentary,...), similarly a number of subtitles for a variety of languages. While not overly-common, a container can providemultiple video streams as well, alternative angles, 360-degree video, picture-in-picture.... Each video stream therefore requires an independent sizing fields to properly display.
So, that's it, that's all I got for now. I feel like I better understand some of the fields, specifically DTS/PTS and a clearer understanding of the various compression frames. Hope it was equally useful to you.
Cheers.