Semantic Fidelity under Video Compression



Stage 1: Camera

This video was compressed by the camera to a total of 16,252,932 bytes or approximately 2,880,874 bps with no loss of semantic information.

Stage 2: Grayscale H.264

Extracting the frames from this video, converting to 8-bit greyscale, and re-encoding as H.264 with the default quality settings for ffmpeg results in a file of 1,306,952 bytes or approximately 231,661 bps with no loss of semantic information.

Stage 3: 160x120, 5fps, H.264

One can reduce the spatial resolution of this greyscale video to 160120 and the temporal resolution to 5 fps and re-encode as H.264 allowing ffmpeg to compress as tightly as it can. This yields a file of 19,800 bytes that can barely be interpreted by humans at a bit rate of approximately 3,510 bps, indicating an apparent limit to this approach.

Stage 4: Thresholded Berkeley edge maps

We then apply the Berkeley edge detector (PB; Maire et al. 2008) to the 8-bit greyscale images extracted from the original video captured by the camera yielding 8-bit graded edge maps and then threshold these graded edge maps at pixel value 1 to yield binary edge maps. If one encodes these binary edge maps as video in a lossless fashion, it is easy to see that there is no loss of semantic information.

Stage 5: Traces with intensity proportional to saliency

If we render the traces c at intensity proportional to S(c) quantized as 8-bit greyscale images and encode these images as video in a lossless fashion, it is easy to see that we have largely eliminated background edges while preserving foreground edges so that there is no loss of semantic information.

Stage 6: Five most-salient contours

If we render the traces corresponding to the solid edges of these contours as binary images (each containing multiple contours) and encode these images as video in a lossless fashion, one observes that we have largely removed the edge fragments that correspond to texture interior to the object boundaries yet the video still contains sufficient information to support action recognition.

Stage 7: Most salient contours

Blindly extracting a fixed number of contours, as we do above, often overcompensates, yielding some contours that do not correspond to the agent. We thus discard contours where the total motion saliency for every trace in that contour is below a threshold, computed as a specified fraction (currently 0.25) of the maximum motion saliency for that frame. Rendering this smaller set of contours as a video still preserves the desired semantic information.

Stage 8: Closed splines

We then fit a closed piecewise cubic B-spline to each contour in each frame. We then quantize the knot coordinates as integer pixel locations within the image boundaries. Rendering these splines as binary images encoded as video in a lossless fashion illustrates that the signal still contains sufficient information to support action recognition.

Back to the Research page