VMAF, short for Video Multimethod Assessment Fusion, was developed by Netflix in 2016 and swiftly established itself as the industry standard tool for video quality assessment. It offers an open-source implementation on GitHub and seamless integration with FFmpeg, making it easily accessible to developers and video encoding professionals. However, one aspect that cannot be overlooked is its demanding resource requirements. VMAF can be highly resource intensive, especially when it needs to be calculated multiple times per profile within a per-title encoding pipeline. In such cases, the cost of quality assessment can potentially exceed the cost of the actual encoding process itself. Therefore, it becomes crucial to explore strategies and optimizations to make VMAF more cost-effective. Let's delve into effective techniques to mitigate these costs and enhance the efficiency of VMAF in video quality assessment.
First, this is one way to run VMAF using FFmpeg:
ffmpeg -i reference_480_1050.mp4 -i distorted_288_375.mp4
-lavfi
"[0:v]setpts=PTS-STARTPTS[reference];[1:v]scale=720:480:flags=bicubic,setpts=PTS-STARTPTS[distorted];[distorted][reference]libvmaf=n_threads=12:n_subsample=1" -f null -
You will use a reference (mezzanine, source) file and a distorted (encoded) version of that reference. The reference and distorted duration, frame rate, and resolution must match, otherwise you will need to apply some extra filters to make it so, like the one in the command above that scales up the distorted file to match the reference.
As the reference file on the experiments below we used the trailer for the acclaimed HBO TV show The Last of Us, a 126s video encoded with h264 at 1080p and 9580 kb/s (144MiB in total). For the distorted version we also used a h264 encoded file at 720p 2800 kb/s. The computer is equipped with an AMD Ryzen™ 5 4600H processor with six CPU Cores and 12 threads running at 3.0GHz, and as for the software FFmpeg 5.1.2 with libvmaf 2.3.1. The charts below were made using plotly.
Sampling frames
Maybe you noticed the n_subsample parameter in the command above. It limits the number of frames for which VMAF will be calculated, reducing the computational time in exchange for accuracy. VMAF outputs one value for each n_subsample frames, and the most common way of analyzing the overall quality of the title is by further calculating the mean, median, or the 1st percentile (which gives you a notion of the worst frames) across all frames. To make an informed decision, let’s first see how much accuracy we lose by increasing this parameter value, then the increased speed we get.
We can see that the error grows slowly for mean and median and is below 0.5 VMAF points even when sampling once every ten frames, which is pretty good considering that the Just Noticeable Difference (JND) for VMAF is of about six points, so as long as you remember to add error bands in your analysis to account for it you should be fine. On the other hand, the 1st percentile is a different story, as the goal of the metric is to identify the worst frames, cutting 1/10th of them implies a higher and more unpredictable error.
How sampling frames affect computational time
Sampling every two frames already cuts the processing time by about 45%. Gain is slower for higher values but with n_subsample=5 we are at about one-third of the time it takes to sample every frame. Additionally, the red line shows the peak memory consumption for each case, and for n_subsample=1 that would be 1322 Mb – quite a lot for a simple 144 Mb reference file – so don’t forget to add some extra gigs to your 4K catalog!
The time relative to the video duration is important because for an ABR ladder with multiple renditions checking the encoding quality of a two-hour movie might take longer than the encoding pipeline itself. Also, it is not as straightforward to parallelize it on a cluster by segmenting the video because VMAF always considers the previous frames to calculate the quality of the current one, so it would be necessary to discard them and account for the offset of the frames in the beginning of the segments.
Some other points on performance
Threading
By default VMAF will use all available CPU threads, but let’s see how much speed up we get from extra cores. The CPU used in the experiments has 6 real cores and 12 threads, and we can see in the chart below that speed up is minimal after filling up all real cores. The gray dotted line indicates the theoretically perfect scenario where doubling the number of threads would cut in half the computational time, and you may also notice that Speed Up is pretty much the same regardless of the distorted file resolution (360p, 720p and 1080p), more on that in the next section.
Impact of video resolution and bitrate
Another thing that might be counterintuitive at first is that calculating VMAF for a 1080p encoded file doesn’t take much more time than for a 144p. See the chart below, which considers a range of resolutions and bitrates for H264 and HEVC.
To understand that, remember that the encoded files will be upscaled to the reference resolution beforehand, so the resolution and bitrate of the reference is what matters the most. To illustrate that, we created HD (720p) and UHD (2160p) versions of our original FHD “The Last of Us” trailer and calculated VMAF over it using n_subsample 1, 2 and 5. In all three cases, FHD was approximately 2.1x slower than HD, and UHD 3.8x slower than FHD.
VMAF over multiple distorted files at once
We can leverage FFmpeg complex filters to run VMAF over multiple distorted files simultaneously. When encoding for multiple outputs there is a gain as reference is decoded only once. Let’s see if libvmaf can benefit from it in the same way. The command line for running FFmpeg over three encoded files will look something like this:
ffmpeg -i reference.mp4 -i video_h264_360p_400k.mp4 -i video_h264_720p_2800k.mp4 -i video_h264_1080p_5000k.mp4 -lavfi
'[0:v]split=3[r0][r1][r2]; [1:v]scale=1920:1080:flags=bicubic[d0]; [2:v]scale=1920:1080:flags=bicubic[d1];
[d0][r0]libvmaf=log_fmt=csv:log_path=video_h264_360p_400k.csv:n_threads=1:n_subsample=5;
[d1][r1]libvmaf=log_fmt=csv:log_path=video_h264_720p_2800k.csv:n_threads=1:n_subsample=5;
[2:v][r2]libvmaf=log_fmt=csv:log_path=video_h264_1080p_5000k.csv:n_threads=1:n_subsample=5' -an -sn -f null -
Input reference is split into three streams, one for each output, producing three CSV files with the results. Distorted files are upscaled if necessary, and as we have n_threads=1 for each distorted file, we are in fact using three cores. Thus, to evaluate the difference we can only compare multiple and single output solutions when the ladder is executed with n_threads 3, 6, 9 or 12. To make it more interesting we tried increasing the n_subsample variable as well, and here is the result, with the percentage gain of the multiple output solution in the y-axis, and total number of threads in the x-axis.
We can only see some meaningful gain above 6% with higher n_subsample values and all threads in use (4 per output in this case, 12 in total). In this scenario parallelizing execution in multiple servers would produce better results than generating multiple outputs in a single machine from the FFMpeg one-liner.
Conclusion
If you use VMAF sporadically to assess the quality of your encoded videos the computational time might not be that relevant, but adding VMAF into more sophisticated pipelines might get very expensive, so we hope the colorful charts above might help to calculate and optimize that.
Here are some conclusions that we have drawn from them:
- Increasing n_subsample will reduce computational time significantly with low impact on accuracy for mean and median, but the 1st percentile will get more unpredictable.
- The distorted bitrate or resolution have no significant impact on computational time.
- The reference resolution and bitrate is what matters, and going from FHD to UHD will make VMAF almost four times more expensive.
Calculating VMAF for multiple distorted files at once doesn’t provide a meaningful speedup unless you use at least n_subsample=10 and have dozens of cores to spare.