By Dr. Cheng Peng, Texas Instruments
The challenge in introducing digital video to
emerging applications, such as smart video
surveillance IP cameras (netcams) is controlling cost while maximizing performance
without compromising reliability or quality.
Traditionally, optimizing a design has entailed extensive hand-coding of algorithms in assembly. Today’s
video development platforms, however, provide
codec and other functionality off-the-shelf. Developers are able to access codec functionality through
APIs which abstract the specific implementation
details of codecs. Such codecs are often already optimized for a particular hardware platform and are
easily integrated into applications using the API.
From an application perspective, it doesn’t
matter what video codec is in use. In fact, if the particular codec and its implementation details can be
transparent to the application, developers gain the
ability to easily interchange codecs during camera
operation (see Figure 1). Interchangeability enables
many important features, such as dynamic adjustment of the video codec to use a lower bit rate until
an event of interest occurs and then switch to a higher bit rate, higher quality codec. This maximizes the
use of limited network bandwidth, reserving it for
those cameras where the most quality and detail is
required.
CODEC TRANSPARENCY
To keep different codecs transparent at the application level, an API must necessarily be somewhat
generic, keeping the API focused on encoding, decoding, and control operations. If the API includes too
many codec-specific operations, the application code
will become tied to that particular codec.
A generic API can significantly speed time-to-market and enables digital video to be easily introduced
into a wide range of new applications. Developers,
however, often assume that the only way they can
optimize a system is to break past the API to call
directly into codec code and to fine-tune codec performance. Breaking past the API, however, destroys
the interchangeability of codecs.
Breaking APIs also erodes the reusability of application code since it becomes tied to the use of particular codecs. This is especially important to developers architecting a family of video products ranging in
functionality. Ideally, when a function such as denoising is implemented, OEMs would like to use the
same code across the product line rather than have to
redesign the function for multiple codecs, display
sizes, or bit rates.
System optimizations can result in substantial cost savings when developers can reduce the overall processing load (enabling a less expensive processor to
be used in final production) or decrease the amount
of memory a system requires. The key is to step
beyond handcoding optimizations or breaking APIs.
By approaching optimization from a system perspective that balances performance, cost, quality, and
reliability, developers achieve a cost-effective design
that provides interchangeability without requiring
tedious code optimizations.
COMPONENT-BASED ARCHITECTURES
Figure 2 shows the basic architecture for an intelligent IP netcam. One of the ways that an effective
API can be defined is to abstract the system into various components which can be implemented in hardware or software, depending upon the actual hardware resources available.
Any of these components could be implemented
completely in hardware or completely in software.
For example, for years ASICs have dominated the
camera market by providing the lowest power consumption with the highest performance. Unfortunately, ASICs have a long development cycle and
result in a fixed implementation. Additionally, the
video coding standards themselves are not stable,
and an MPEG-4 ASIC that can only handle MPEG-4
but not H.264 is already obsolete.
Video development must also take into account
perhaps the most important—and compute intensive
—innovations in surveillance: video analytics. Video
analytics lend intelligence to cameras though capabilities, such as recognizing objects and triggering
events based on their behavior. This emerging technology is extremely volatile, not only in the capabilities supported but also the rapidly changing and
innovative ways in which they are implemented.
FPGA or other programmable logic approaches
introduce more flexibility than ASICs while maintaining high performance but still require a very long
development cycle and high cost. Nor is a softwareonly implementation feasible. Many video functions
are relatively fixed in their base computations, and
are well-suited for being processed in parallel. The best approach for addressing the complexity and fluidity of video analytics
and pre-processing functions such as
de-interlacing, de-noising, and color
space conversion is to utilize a mixed
hardware and software approach that
frees the main processor for other tasks. This minimizes development
time, maximizes performance, and
enables developers to abstract these
functions efficiently through an API. It
further results in an upgrade path that
minimizes time-consuming changes at
the application level since developers can interchange codecs without modifying either hardware or the main
application.
COMPONENT-LEVEL
OPTIMIZATION
When functions are broken into components that can be implemented
in software or hardware and abstracted by an API, it becomes more difficult
to optimize individual components.
With codecs and most video analytic
functions available off-the-shelf, however, it is no longer necessary or desirable for developers to optimize at the
individual component level. In fact, it
becomes extremely difficult to do so
and in any case, the performance gains
are rarely worth the development
investment. Optimization, then, shifts
to ensuring efficient interaction—both
direct and indirect—between components, such as how an audio codec and
video codec cooperate in their use of
shared system resources including
processor cycles, memory, and DMA
bandwidth.
Memory is a critical system resource
that directly affects system cost. Fortunately, codec implementations tend to
be highly efficient relative to the overall memory available on a processor
and do not require much internal
memory. For example, an MPEG-4
encoder implemented on a TI DM642
processor needs just 256K internal
memory.
Hand-optimized codec implementations, may take advantage of fixed resolution and buffer sizes, limiting their
configurability. To achieve true interchangeability, codec implementations
need to be flexible enough to support
various frame sizes and resolutions to
adjust the bit rate to best utilize network bandwidth.
The advantage of a configurable
implementation is that interchangeability is supported directly by the
algorithm and changing bit rate is a
matter of changing the configuration
of the algorithm—i.e., lossiness, resolution, or frame rate—which in turn
defines how large and how many
buffers are required. Additionally,
developers are able to leverage this
configurability to easily balance performance and memory usage. Reducing the size of frame buffers, for example, enables developers to trade off
memory usage against performance. It
also becomes possible to dynamically
protect against network jitter by
increasing buffer size when necessary,
smoothing out network latencies.
One area where memory efficiency
can be preserved by developers is in
how codecs and data structures are
instantiated at the application level.
Consider a surveillance application
where cameras are left on for months
at a time. With any long-running application, fragmentation of heap
memory can become an issue when
algorithms are dynamically implemented. Certain data structures offer
better performance when allocated in
the same memory page, so it becomes
important to manage heap memory
carefully. Video applications require a
wide variety of large buffers, such as
Group of Pictures, I-B-P frames , overlays, etc. For the best algorithm efficiency, these buffers need to be contiguous. Thus static functions should
be declared first so that they are allocated at the top of heap memory. If
dynamic functions are allocated first,
heap memory will be separated into
multiple smaller pools that cannot be
recombined.
One viable approach is to allocate
buffers using the same base block size.
Fragmentation is avoided because
buffers are equal in size. While some
memory may be allocated that is not
used in a particular buffer, this is a
small price to pay to avoid the
extreme long-term difficulties eventually caused by fragmentation. Not
only can algorithms process data in a
continuous chain, loading of (multiple) buffers can be easily accelerated through DMA mechanisms.
DMA is, without question, one of
the essential elements of achieving
optimal performance. Programmable
DMA engines enable processors to
move large blocks of data on and off
chip directly into or out of codec data
structures in the background without
requiring direct interaction from the
main processor.
The challenge that arises is that
DMA requests within codecs are independent from DMA transfers initiated
by the application. Without coordination, it is inevitable that DMA
requests will interfere with other, hindering the efficient movement of data.
Additionally, DMA also bypasses the
need for temporarily storing large
amounts of data when passing or
receiving video buffers across codec
APIs; passing buffers as parameters
often requires that an extra copy be
made to prevent unintended overwriting of data. Such copying quickly
erodes performance and memory
bandwidth. It may be tempting for
developers to break codec APIs to
handle DMA transfers directly, but the
fact that different codecs handle data
differently locks the specific implementation within the application and
destroys the interchangeability of
codecs.
Rather than bypassing the API
when moving data, codecs and application code can cooperate through the
use of a single DMA interface which
efficiently manages allocation of
DMA resources. When both codec
and application adhere to the interface, DMA efficiency is automatically
maximized for both codec and application code without manual tuning.
Interchangeability is thus preserved
while at the same time eliminating the
extras reads and writes associated
with passing buffers as parameters.
DATA STRUCTURE ACCESS
WITHOUT BREAKING APIS
To optimize overall performance
without affecting interchangeability
also requires application-level access
to key video data structures. If a
codec is completely hidden behind the
abstraction of an API, this can result
in significant loss of video quality or
performance.
Consider that when the frame rate is
dropped, the application must ensure that the highest quality frames are not
the ones dropped. For example,
MPEG-4 uses I-B-P frames, with Iframes capturing the most detail. Ideally, the application should drop B- or
P-frames before it ever drops an Iframe. However, it can only do this if
the codec tags I-frames so that they can
be differentiated from B- and P-frames.
The same applies to transcoding,
where video is decoded from one format and encoded into another. If only
the resultant decoded video stream is
used, losses in the decoding process
will propagate and further degrade
quality video. When the transcoder
has direct access to the motion vectors used to create the decoded frames,
however, quality can be preserved.
Application access into codec data
structures is even more important for
the support of video analytics. An
object recognition component that can
utilize processing already completed
by the codec (rather than having to
duplicate such processing itself) preserves processing resources. Alternatively, an event triggering component
must be able to generate alerts to the
codec to increase the target bit-rate as
quickly as possible and use JPEG compression for pre/post-alert snapshots.
Video analytics is a fast moving
field, with no standards and continuous innovation. APIs that don’t provide hooks past the API abstraction
force developers to break the API,
tying the video analytic implementation to the particular codec implementation, limiting reuse of the video analytic as well as destroying
interchangeability.
Interchangeability is a critical foundation of today’s digital video camera
applications, and platforms, such as
TI’s DaVinci technology, are configured to ensure this is possible with
minimal retooling. The ability to
dynamically adjust a video stream’s bit
rate and quality based on resolution,
frame rate, and codec format enables
developers to maximize bandwidth
utilization. By keeping codec implementation transparent to the application through the use of APIs, interchangeability is preserved. Optimization, then, becomes a system level
process. Rather than focusing on
hand-optimization of codecs, which
requires too much development
investment with too little gain, developers instead optimize the interaction
between codecs and application code.
In this way, APIs can be preserved
while enabling codec interchangeability and promoting application code
reuse.
Dr. Cheng Peng is a DSP Video Systems Engineer at Texas Instruments. He can be reached at c-peng2@ti.com