J Sign Process Syst
DOI 10.1007/s11265-013-0831-6
Multimedia Communications Using a Fast and Flexible DVC to H.264/AVC/SVC Transcoder
Alberto Corrales-García • Rafael Rodríguez-Sánchez • José Luis Martínez • Gerardo Fernández-Escribano • Francisco José Quiles
Received: 7 November 2011 /Revised: 2 July 2013 / Accepted: 9 July 2013 # Springer Science+Business Media New York 2013
Abstract The evolution of network technologies and mobile devices (equipped with low-cost video cameras) offer new multimedia services for mobile telephony, such as video communications. However, this kind of multimedia services needs to meet special requirements in terms of low complexity on both sides of the communication. Currently, most of mobile video communications are based on traditional codecs, which concentrates high complexity on the encoder side. Then, Distributed Video Coding tackles the problem of tougher complexity constraints for encoding algorithms at the expense of increasing decoder complexity. Taking into account the benefits of both paradigms, Distributed Video Coding to H.264 transcoding provides such multimedia systems with low complexity on both sides. Moreover, there is a H.264 extension called Scalable Video Coding which supports a variety of networks and devices. This proposed scheme moves the highly-complex processes to the transcoder, which has more resources. However, to achieve a low-delay transmission between mobile devices, transcoder time must be reduced. For this purpose, this paper focuses on reducing the complexity of the transcoder. To start with, the first transcoding stage is improved by means
A. Corrales-García (*) • R. Rodríguez-Sánchez • J. L. Martínez •
G. Fernández-Escribano • F. J. Quiles
Albacete Research Institute of Informatics, University of
Castilla-La Mancha, Campus Universitario s/n,
02071 Albacete, Spain
e-mail: albertocorrales@dsi.uclm.es
R. Rodríguez-Sánchez e-mail: rrsanchez@dsi.uclm.es
J. L. Martínez
e-mail: joseluismm@dsi.uclm.es
G. Fernández-Escribano e-mail: gerardo@dsi.uclm.es
F. J. Quiles
e-mail: paco@dsi.uclm.es Published online: 14 August 2013
of a multicore processor, which executes the decoding algorithm in parallel. Then, the second stage uses the motion vectors generated during the first decoding stage to reduce the motion estimation complexity of the H.264 encoder and of its scalable extension as well. To support different Distributed Video Coding/H.264 patterns and profiles, the proposed transcoder includes a mapping between different kinds of frames and GOP lengths from both paradigms. As a result, this paper proposes an efficient algorithm to support mobile-to-mobile video communications which reduces the transcoding time about 70 % without significant ratedistortion penalty.
Keywords Transcoding . H.264 . SVC . DVC . Wyner-Ziv . Parallel Computing
1 Introduction
Nowadays, mobile devices demand multimedia services such as video communications due to the advances in mobile communications systems and the integration of video cameras in mobile devices. However, these devices have some limitations regarding computing power, resources and complexity constraints for running complex algorithms. For this reason, in order to establish video communications between mobile devices it is necessary to use low complexity encoding techniques. In traditional video codecs such as H.264 or MPEG-4 part 10 Advanced Video Coding (AVC) [1] most computation is performed on the encoder side. Furthermore, Scalable Video Coding (SVC) [2] is an extension of the H.264 standard. H.264/SVC streams are composed of layers which can be removed to adapt the streams to the requirements or capabilities of the end user devices or the network conditions [3]. H.264/SVC is also based on high complex encoding algorithms. To reduce this complexity, traditional solutions employ low complexity tools. As a
consequence, mobile video communications with low complexity encoding algorithms based on traditional standards imply an aggressive loss of Rate—Distortion (RD) performance to reduce the complexity of the encoder. On contrary, Distributed Video Coding (DVC) [4] provides a novel video paradigm where the complexity of the encoder is reduced by moving the complexity of the encoder to the decoder [5]. At this point, DVC to H.264/AVC/SVC transcoders offer a solution for these scalable mobile to mobile video communications. In this framework, as is shown in Fig. 1, the H.264/AVC and H.264/SVC decoder and DVC encoder are allocated to the limited constraint user devices and the H.264/AVC/SVC encoder and DVC decoder are moved to the network, where the transcoder is located. Nevertheless, there are many differences between the DVC decoding and H.264/AVC/SVC encoding algorithms and these have to deal with many problems. Firstly, the Group of Picture (GOP) structure in the two paradigms is different; while GOPs in H.264 normally have a length of 12 (or even 15), the DVC GOP length is 2,4 or 8. H.264/AVC/SVC basically defines I, P and B frames, whereas in DVC the frames can be Key (K) or Wyner-Ziv (WZ). Moreover, both kinds of GOPs have different patterns. Although a K frame is an approximation of an I frame, WZ frames have different meanings from P and B frames. Conversion between different GOPs is a desired feature in a transcoding framework because making different patterns can change the amount of data generated by the bitstream, depending on the network requisites. In this respect, this paper is a further development in the framework of DVC to H.264 video transcoders and offers a GOP mapping solution between K and WZ frames to whatever kind of frame is required in H.264/AVC/SVC. Moreover, some refinement in the H.264/AVC/SVC Motion Estimation (ME) algorithm is introduced. This refinement is adjusted
depending on the incoming DVC frame and the outgoing H.264/AVC/SVC frame. The mismatches between GOP lengths are also solved in the proposed transcoder.
Since DVC coding was proposed, the research community has realized the advantages of DVC video coding for low complexity video encoding. For this reason, several transcoding approaches to support mobile-to-mobile communications have been proposed (as will be mentioned in section 3.2). However, they have focused on reducing the complexity of the traditional video encoding stage by means of DVC information, but DVC decoder complexity was not taken into account. This means DVC decoding delays the transmission between mobile devices. Most of the DVC decoding time is spent in the iterative turbo decoding through the feedback channel. With regards to transcoding, commercial transcoders are computers with many resources which include high computation devices, such as Multicore Processors (e.g. DVEO commercial transcoders [6]). Taking this fact into account, the proposed architecture includes a method to execute DVC decoding in parallel, thus achieving faster DVC decoding. Firstly, in a GOP level parallelism, where a different GOP decoding process is carried out on a different thread and, secondly, a parallel decoding method spatially splits each frame into several slices and distributes them among the different computing units of a multicore processor. This provides high flexibility because this parallel decoding method can be applied to different architectures (with or without a feedback channel), GOPs and domains (Transform Domain or Pixel Domain). In addition, it is not dependent on particular hardware, so it could be used by future Multicore Processors with more cores.
To sum up, in this paper we propose a more realistic architecture, where both higher complexity parts (DVC decoding and H.264 encoding and its SVC extension) are
improved in order to reduce the delay for mobile-to-mobile video communications.
This paper is organized as follows: Section 2 presents an overview of the DVC decoder and H.264/AVC/SVC encoder architectures; Section 3 identifies the state-of-the-art transcoders based on DVC; Section 4 describes the proposed video transcoder, which is evaluated in Section 5 with some simulation results; finally, in Section 6, conclusions are presented.
2 Technical Background
2.1 Wyner-Ziv Video Coding
Wyner-Ziv video coding is a particular case of DVC which deals with lossy source coding with side information at the decoder. The first Wyner-Ziv architecture was proposed by Stanford in [4, 7], and this work was widely referenced and improved on in later proposals [8, 9]. As a result, in [10] an architecture called DISCOVER was proposed, and this outperforms the previous Stanford one. This architecture provided a reference for the research community and finally it was improved upon in the form of the VISNET-II architecture [11], which is depicted in Fig. 2. The Wyner-Ziv architecture adopted in this paper is based on the one proposed by the VISNET-II team. More details about this architecture can be found in [8-11].
Although the main aim of DVC is to provide low-complexity encoders, the high complexity of the decoders could present difficulties to support applications which have delay requisites (such as video transcoding). In Fig. 3, the distribution of the complexity between main DVC decoding modules is analyzed. For sequence Foreman at 15 fps and QCIF resolution, 150 frames were decoded with a GOP 2 and 1-4 bitplanes. The vertical axis represents the time spent (in seconds), and the horizontal axis indicates the number of bitplanes used to encode the sequence and the z-axis
represents each module. As can be observed, the Turbo Decoder module used up most of the decoding time, increasing considerably when more bitplanes were decoded. Times displayed in Fig. 3 were measured for the whole Foreman sequence.
2.2 H.264/AVC
The main purpose of H.264 is to offer a good quality standard able to considerably reduce the output bit rate of the encoded sequences, compared with previous standards, while exhibiting a substantially increased definition of quality and image. H.264 promises a significant advance compared with the commercial standards currently most in use (MPEG-2 and MPEG-4) [12]. For this reason H.264 contains a large amount of compression techniques and innovations compared to previous standards; it allows more compressed video sequences to be obtained and provides greater flexibility for implementing the encoder. Figure 4 shows the block diagram of the H.264 encoder.
2.3 H.264/SVC
As said previously, SVC [2] is an extension ofH.264/AVC. SVC streams are composed of layers which can be removed to adapt the streams to the needs of end users or the capabilities of the terminals or the network conditions. The layers are divided into one base layer and one or more enhancement layers which employ data of lower layers for efficient coding.
SVC supports three main types of scalability: 1) Temporal Scalability: The base layer is coded at a low frame rate. By adding enhancement layers the frame rate of the decoded sequence can be increased. 2) Spatial Scalability: The base layer is coded at a low spatial resolution. By adding enhancement layers the resolution of the decoded sequence can be increased. And, 3) quality (SNR) Scalability: The base layer is coded at a low quality. By adding enhancement layers the
Figure 2 DVC architecture scheme.
1 WZ and Conventional Video Splitting
Wyner-Ziv Encoder
3a 3b Bit
Q —* Ordering
Channel Encoder
Conventional Video Encoder
Wyner-Ziv Decoder
Channel Decoder
1 _i_j~£
Feedback Channel
Correlation Noise Model
Conventional Video Decoder
5 Side Information Extraction
Buffer
Reconduction
quality of the decoded sequences can be increased. Our proposal is based in provides temporal scalability. Figure 5 shows the block diagram of the H.264/SVC encoder.
3 Related Work
3.1 Reduced DVC Decoder Complexity
The DVC framework is based on displacing the complexity from encoders to decoders. However, reducing the complexity of decoders as much as possible is desirable. In traditional feedback-based DVC architectures [4], the rate control is performed at the decoder and is controlled by means of the feedback channel; this is the main reason for the decoder complexity, as once a parity chunk arrives at the decoder, the turbo decoding algorithm (one of the most computationally-demanding tasks [5]) is called. Taking this fact into account, there are several approaches which try to reduce the complexity of the decoder, which usually induces a rate distortion penalty. However, due to technological advances, new parallel hardware is beginning to be introduced into practical video coding solutions. These new features of computers offer a new challenge to the research community with
regards to integrating their algorithms into a parallel framework; this opens a new door in multimedia research. It is true that, with regards to traditional standards, several approaches have been proposed since multicores appeared on the market, but this paper focuses on parallel computing applied to the DVC framework.
Having said this, in 2010 several different parallel solutions for DVC were proposed. In particular, in [13] Ryanggeun et al. proposed a DVC parallel execution carried out by Graphic Processing Units (GPUs). On the other hand, in [14] Momcilovic et al. proposed a DVC LDPC parallel decoding based on multicore processors. Both approaches propose low-level parallelism for a particular LDPC/LDPCA implementation. Other solutions available in the literature try to reduce the decoder complexity of DVC by means of soft-computing techniques [15] or, even by eliminating totally or partially the feedback channel [16, 17].
3.2 DVC to H.26x Transcoding
Transcoding from a low cost encoder format to a low cost decoder provides a practical solution for these types of communications. Although H.264 has been included in multiple transcoding architectures from other coding formats
Figure 4 H.264 encoder architecture scheme.
* X }Dn T -► Q —X
Fn Deblocking
(reconstructed) Filter
■0< Dn^T7
Entropy encode
Reorder
Figure 5 H.264/SVC encoder architecture scheme.
Layern
Motion-
compensated —I and intra
precüctiCTi I Side
Residual
coding
Inter-layer j intra prediction ♦
Residual,
Motion-compensated and intra
prediction Side
Motion prediction
Inter-layer j intra prediction
Layer 0
Residual
Motion-
compensated —
and intra
prediction Side L
information
Inter-layer 1
residual ♦
prediction f
Motion prediction
Entropy coding
Inter-layer
motion prediction
Residual coding
information
Inter-layer residual prediction
Motion prediction
Entropy coding
Residual coding
Inter-layer
motion prediction
Entropy coding
Motion prediction
Multiplex
Scalable bit stream
H.264/AVC compatible base layer bit stream
Residual
information
(such as MPEG-2 to H.264 [18, 19] or even H.264 to H.264/SVC [20]), proposals in DVC to H.26x to support mobile communications are rather recent and there are only a few approaches so far.
In 2008, the first DVC transcoder architecture was introduced by Peixoto et al. in [21]. In this work, they presented a DVC to H.263 transcoder for mobile video communications. However, H.263 offers lower performance than other codecs based on H.264, as the one proposed in this paper.
In our previous work, we proposed the first transcoding architecture from DVC to H.264 [22]. This work introduced an improvement to accelerate the H.264 MabroBlock (MB) mode coded decision by means of machine learning techniques. Nevertheless, this transcoder is not flexible since it only applies the ME improvement for transcoding from WZ frames to P frames. In addition, it only allows transcoding from DVC GOPs of length 2 to IPIP H.264 GOP patterns, so it does not use practical patterns due to the high bit rate generated. Furthermore, this work [22] only focuses on the second part of the transcoder, the H.264 encoding part and it does not treat the DVC decoder complexity. Furthermore, other previous work [23-25] focuses on reducing DVC decoder (first part of the transcoder) complexity based on parallel architectures: bitplanes [23], spatial portions [24] and GOPs [25]. At this point, this paper proposes a full transcoder which treats both parts of the transcoding algorithm. On the one hand, we introduce a hybrid parallel algorithm based on GOP and spatial portions. In addition,
the proposed algorithm is scalable because it does not depend on the hardware architecture, the number of cores or even on the implementation of the internal Wyner-Ziv decoder. Therefore, the time reduction can be increased simply by increasing the number of cores, as technology advances. Furthermore, the proposed method can also be applied to DVC architectures with or without a feedback channel [26]. On the other hand, for the second part of the transcoding algorithm, we propose a motion search area reduction. Moreover, this paper is adapted for the scalable extension H.264/SVC. The present approach presents results for each part of the transcoding algorithm and each part separately as well. Results in terms of CPU usage and compared to the state of the art will be also shown. In the framework of DVC to H.264/SVC, as far as the authors of this paper know, this is the only work available in the literature. However other transcoding techniques involving SVC can be found in [27-29].
4 Proposed Transcoder
The main task of a transcoder is to convert a source coding format into another one. In the case of mobile video communications, the transcoding process should be done as fast as possible. In addition, a flexible transcoder should take into account the conversion between the input and the output patterns. In order to provide a flexible and fast transcoding
architecture, in this paper we propose the architecture displayed in Fig. 6. It is composed of a Wyner-Ziv decoder and a H.264/AVC/SVC encoder with several modifications or extra modules. In Fig. 6, black modules have been included or modified. Details will be given in the following subsections.
4.1 Parallel Wyner-Ziv Decoding
Nowadays, most commercial computers include several processing units in one chip or multicore processor. However, few applications take advantage of the computational capacity offered by multiple processing units because their
executions are sequential. In this way, one processing unit is overloaded whereas the rest are wasted. Therefore, it is the task of the researcher to develop and design algorithms which can translate applications from sequential to parallel execution and thus exploit the higher performance of the processors.
As Fig. 3 showed, most of the complexity of DVC decoding algorithm is accumulated in the turbo decoder module. In order to reduce this time, our architecture proposes a DVC parallel decoding architecture based on multicore processors. As Fig. 6 shows, the proposed architecture is composed of several Wyner-Ziv partial decoders which work in parallel. From previous parallel WZ decoding
Figure 6 Proposed DVC to H.264/AVC/SVC architecture. Ô Springer
schemes presented in [23-25], we can introduce several conclusions. First, a parallel distribution based on BPs [23] is quite limited since the level ofparallelism depends directly on the number of BPs encoded, and the impact of rate distortion is quite high due to the CNM dependencies between BPs. Second, distributing the data at frame level [24], generating one SI per data, would be a good solution. However, decoding small partitions could have a slight impact on turbo code performance. Finally, the GOP parallelism [25] level performs well in terms of rate distortion because it does not introduce any rate distortion penalty. At this point, a flexible and scalable architecture is proposed, which distributes the burden of complexity over two levels, namely GOP and frame levels, in order to obtain the advantages of both: a negligible rate distortion penalty based on the GOP approach, and a low delay introduced by the spatial distribution approach.
The parallel decoding works as follows for each frame:
(1) The input bitstream composed of K frames is stored in a K-frame buffer. Then, within the first parallelism level, the WZ frames inside two K frames delimit a GOP structure, and therefore a different GOP decoding process is carried out in a different core.
(2) From K frames, SI is calculated for the whole frame. That means, SI generation is not parallelized; this is managed in this way because the accuracy of the SI has a significant impact on turbo decoder performance. If a different SI is calculated for each partition, the quality of the SI drops and more iterations (bitrate) will be needed to achieve the same final quality. In addition, in this step, the MVs generated during the SI process are stored in an MV buffer in order to use them during the H.264 encoding stage, as will be explained in section 4.2.
(3) In addition, each frame decoding inside a GOP is spatially-decoded in parallel. For each WZ frame, a SI is calculated and divided into several portions, which are distributed among the available cores. The WZ frame itself (that which is turbo-decoded) has been spatially divided at the encoder and the parity bit information should be divided within the partition. This parity information and its corresponding SI (it is also split into the same number of parts) are assigned to a partial decoder by a scheduler. This scheduler works in a dynamic way, so that whenever a core finishes a task, the scheduler assigns another one (if there are any pending tasks).
(4) Each partial decoder is executed by a different core and the decoding process works in parallel. In addition, each partial decoder includes one partial CNM to obtain an approximation of the residual distribution between the whole SI part and the original frame part, which has
been adapted to work in parallel. For each part, several BPs are decoded depending on the qualification applied.
(5) When the decoding of every frame part finishes, all parts are joined and the frame is reconstructed as in the sequential version.
4.2 Proposal for Transcoding and Mapping GOPs
In order to provide fast and flexible transcoding at the H.264 encoder side, we have to study two issues: firstly, how MVs generated during the SI process could help to reduce the time used in ME (both H.264 and its SVC extension); secondly, taking into account that DVC and H.264 can build different GOPs, how to map MVs between different GOP combinations in order to provide flexibility.
4.2.1 Reducing Motion Estimation Complexity
As is explained in [5], within the DVC decoding process an important task is SI generation, which is the first step in the process for generating the WZ frames from K frames. VISNET-II performs Motion Compensated Temporal Interpolation (MCTI) to estimate the SI. The first step of this method is shown in Fig. 7, which consists in matching each forward frame MB with a backward frame MB inside the search area. The process checks all the possibilities inside the search area and chooses the MV that generates the lowest residual. The middle of this MV represents the displacement for the MB interpolated (more details about the SI generation process in [30]).
Obviously, MVs generated in the DVC decoding stage contain approximated information about the quantity of movement of the frame. Following this idea, the present approach proposes to reuse the MVs to accelerate the H.264/AVC/SVC encoding stage by reducing the search area of the ME stage. Moreover, the present reduction is adjusted for every input DVC GOP to every H.264 GOP in an efficient and dynamic way. As is shown in Fig. 8, the search area for each MB is defined by a circumference with a radius dependent on the incoming SI MV (Rmv). This search area can oscillate between a minimum (defined by Rmin) and a maximum (limited by the H.264 search area). In particular, the length will vary depending on the type of frame and the length of the reference frame, as will be explained in section 4.2.2. Furthermore, a minimum area is considered since MVs are calculated from 16x16 MBs in the SI process, and H.264 can even work with smaller partitions than 16x16. Besides, SI is an approximation of the frame, so some changes could occur when the fame is completely reconstructed. For these reasons, this minimum was set at 4. For the scalable version, the information gathered is the same but, the
Figure 7 First step of SI generation process.
Search area
Lowest Residual MB
MB interpolated
MB source
algorithm also include the layer which the frame belongs to. This is because the ME procedure is called in both paradigms.
In a nutshell, the algorithm for reducing the motion estimation works as follows: as is shown in Fig. 8, WZ provides one backward and one forward MV for every 8x8 subpartition in each MB. The backward ones are used for P-frame decoding in H.264, and B-frame decoding uses both MVs. Then, depending on the sub-partition to be checked by H.264, the final MV predicted is calculated as follows: if the sub-partition is bigger than 8 x 8 (16 x 16, 8 x 16, 16 x 8), the predicted MV is calculated by taking the average of the MVs included in the sub-partition. If the sub-partition is equal or smaller than 8x8, the corresponding MV is applied directly. Then, for each MB and sub-MB partition, one MV for P frames (previous reference) and two MVs for B frames (previous and future references) are considered. Everyone of this MV is composed of two components: MVx and MVy, which are multiplied by a factor depending on the distance of their reference frames. In this way, the resulting dynamic search area is defined by a circumference with a radius that is dependent on the estimated MV (Rmv) and the minimum radius labeled as Rmin.
4.2.2 Mapping GOPs
One desired feature of every transcoder is flexibility. To achieve it an important factor is to perform a proper GOP mapping. This approach proposes a DVC to H.264 transcoder which allows every mapping combination by performing this task using techniques to improve time use. To extract MVs, first the distance used to calculate the SI is considered. For example, Fig. 9 shows the transcoding process for a DVC GOP of length 4 to a H.264 pattern IPPP (baseline profile). In step 1, DVC starts to decode the frame labeled as WZ2 and the MVs generated in its SI generation are discarded because they are not closely correlated with the proper movement (low accuracy). When the WZ2 frame is reconstructed (through the entire DVC decoding algorithm, WZ'2) in step 2, the DVC decoding algorithm starts to decode frames WZ1 and WZ3 by using the reconstructed frame WZ'2. At this point, the MVs V0-2 and V2-4 generated in this second iteration of the DVC decoding algorithm are stored. These MVs will be used to reduce the H.264 ME process. Notice that in the case of higher GOP sizes the procedure is the same. In other words, MVs are stored and reused when the distance between SI and the two reference frames is 1.
Figure 8 Search area reduction for H.264 encoding stage.
WZ MVs
subBlock_2|subBlock_3
16x16 WZ MB
H.264/AVC encoding info: -Frame prediction
(B or P) -Reference frame distance
H.264/AVC current partition: 8x8
8x16 16x8
8x4 4x8 4x4
H.264/AVC search area reduction
H.264/AVC search area
Minimun search
area Variable search area
Figure 9 Mapping from DVC GOP of length 4 to H.264 GOP IPPP.
Ko WZi WZ2 WZ3 K4 DVC step 1
----____
-V 0-4-
K0 WZi WZ'2 WZ3 K4 DVC step 2
-V 0-2/2-
-V0-2/2-
-V 2-4/2
H .264
V 2-4/2-
Finally, V0-2 and V2-4 are divided into two halves because P frames have the reference frame with distance one and MVs were calculated for a distance of two during the SI process.
For more complex patterns, which include mixed P and B frames (main profile), this method can be extended in a similar way with some changes. Figure 10 shows the transcoding from a DVC GOP of length 4 to a H.264 pattern IBBP. MVs are also stored by always following the same procedure. However, in this case the way to apply them in H.264 changes.
For P frames, MVs are multiplied by a factor of 1.5 because MVs were calculated in DVC for a distance of 2 and P frames in H.264/AVC have their references with a distance of 3. For B frames, it depends on the position that they are allocated and it changes for backward and forward searches.
As can be observed, this procedure can be applied to both K and WZ frames. Therefore, following this method the proposed transcoder can be used for transcoding from every DVC GOP to every H.264 GOP.
For the SVC extension the procedure is quite similar. For each WZ frame there are two MVs, whereas P frames use only one direction for prediction. In this way, MVs may be
mapped to use them later to reduce the ME. Figure 11 shows the transcoding from a WZ GOP 2 to a H.264/SVC IPPP GOP with two layers and a hierarchical prediction structure [3]. The first K frame is transcoded to an I frame without any conversion. On the other hand, for every WZ frame there are two MVs (forward and backward predicted). Then, the orientation of backward MVs is changed and each MV is assigned to the H.264/SVC frames, which will be placed in different layers, by keeping the position of the frame in the WZ sequence, as is shown by Fig. 11. Once the MV is mapped, it may be used to adapt the dynamic search area, as will be explained before.
For B frames, which have two references, the mapping method changes. Figure 12 shows the transcoding from a WZ GOP 2 to a H.264/SVC GOP 4 with B and P frames. Similarly to the previous case, the first K frame is also transcoded into an I frame without any conversion. For the rest of the frames, both MVs from WZ frames are mapped considering the position of each WZ frame. For B frames, both MVs are considered, but the orientation is changed if necessary, such as in frame 2. For P frames, just the backward MV is mapped.
Figure 10 Mapping from DVC GOP of length 4 to H.264 GOP IBBP
Ko WZi WZ2 WZ3 K4
-4 _ __—
-V0-4-
Ko WZi WZ'2 WZ3 K4
DVC step i
DVC step 2
Figure 11 Mapping from DVC GOP of length 2 to H.264/SVC GOP IPPP pattern and 3 layers.
5 Performance Evaluation
5.1 Test Conditions
In order to evaluate the proposed transcoder, four representative QCIF sequences with different motion levels were considered and different resolutions QCIF and CIF. These resolutions are adequate for mobile to mobile video communications. These sequences were coded at 15 fps (QCIF ones) and 30 fps (for CIF sequences) using 150 frames and 300 respectively. These sequences and resolutions have adopted as reference is related works such as [5, 10, 11, 17]. In the DVC to H.264 transcoder applied, the DVC stage was
generated by the VISNET II codec [11] using PD with BP=3 as quantification in a trade-off between RD performance and complexity constraints but with whatever BP could be used. In addition, sequences were encoded in DVC with GOPs of length 2, 4 and 8 to evaluate different patterns. The parallel decoder was implemented by using an Intel C++ compiler (version 11.1) which combines a highperformance compiler as well as Intel Performance Libraries to provide support for creating multi-threaded applications. In addition, it provides support for OpenMP 3.0 [31]. In order to test the performance of parallel decoding, it was executed over an Intel i7-940 multicore processor [32], although the proposal is not dependent on particular hardware.
Figure 12 Mapping from DVC GOP of length 2 to H.264 GOP IBBBBP pattern and 3 layers.
MVq.^ ^MVi-2^ "^MV2-3^ ^MVs-4
-MVl-2^ ^MV2-3'
H.264/SVC layer 2
H.264/SVC layer 1
H.264/SVC layer 0
-MV3-4
For the experiments, WZ parallel decoding is performed by 8 partial decoders, since the Intel i7-940 processor is composed of 4 cores which can execute 8 threads (2 threads per core) at the same time. For the hybrid-based approach (described in Section 4.1), each frame is split into 3 divisions, where each core has thus a third part of the frame. In other words, the spatial portions for QCIF sequences are 176x48 pixels and 352x96 pixels for CIF sequences (third part of a frame). Then, three consecutive GOPs are executed at the same time, and thus the K frame buffer needs four K frames in order to be filled.
During the decoding process, the MVs generated by the SI generation stage were sent to the H.264/AVC or H.264/SVC encoders; hence it does not involve any increase in complexity. In the second stage, the transcoder performs a mapping from every DVC GOP to every H.264/AVC/SVC GOP using QP=28, 32, 36 and 40. In our experiments we have chosen different H.264 patterns in order to analyze the behavior for the baseline profile (IPPP GOP) and the main profile (IBBP pattern). For H.264/SVC we have tested baseline profiler (IPPP) and main profiler (IBBBP), both profiles are working with hierarchical coding [3] by using temporal scalability with 15 fps (2 layers) and 30 fps (3 layers). These patterns were transcoded by the reference and the proposed transcoder. The H.264 reference software used in the simulations was the H.264 JM reference software (version 17.0) [33] and JSVM 9.9.14 reference software [34]. As we said in the abstract, the framework described is focused on communications between mobile devices; therefore, a low complexity configuration must be employed. For this reason, we have used the default configuration for the H.264 and H.264/SVC main and baseline profile, only turning off the RDOptimization. The reference
transcoder is composed of the whole DVC decoder followed by the whole H.264/AVC/SVC encoder. In order to analyze the performance of the proposed transcoder in detail we have taken into account the two halves (Tables 1 and 2 for DVC decoding part and Tables 3, 4, 5, 6 and 7 for H.264/AVC/SVC part) and global results are also presented (Tables 8, 9, 10 and 11).
5.2 RD and Time Results
In the DVC decoding scenario, the metrics of interest are the RD function: bitrate (BR) against quality (PSNR). To calculate the PSNR difference, the PSNR of each sequence was estimated before transcoding starts and after transcoding finishes. Then the PSNR of the proposed transcoding was subtracted from the reference one for each H.264/AVC/SVC RD point, as defined by Eq. 1.
APSNR (db) = PSNRreference-PSNRproposed (1)
Equation 2 was applied in order to calculate the bitrate increment (ABR) between reference and proposed DVC decoders as a percentage. Then a positive increment means a higher bitrate is generated by the proposed transcoder.
ABitrate(%) = 100* fBRproposed-BRreferencA (2)
BRreference
Concerning the time reduction (TR), it was estimated as a percentage by using Eq. 2. In this case, negative time reduction means decoding time saved by the proposed DVC decoding.
TimeReduction(%) = 100* f Tproposed~Treference\ (3)
T reference
Table 1 Performance of the proposed DVC parallel decoder for 15fps QCIF sequences (first stage of the proposed transcoder).
Foreman 2 30.25 295.58 0.66 -75.99
4 29.73 450.59 -0.05 -70.88
8 28.96 571.31 -1.04 -76.98
Hall 2 32.81 222.24 -2.52 -75.14
4 33.1 224.09 7.99 -70.47
8 33.06 224.13 9.05 -69.34
CoastGuard 2 30.14 289.84 1.11 -72.80
4 30.13 371.02 1.38 -74.46
8 29.65 437.85 1.91 -74.73
Soccer 2 29.56 377.15 -2.67 -74.17
4 29.05 593.66 -2.82 -75.81
8 28.34 735.48 -3.20 -73.94
mean 0.82 -73.73
15 fps sequences
Reference DVC decoder Proposed DVC parallel decoder
Sequence GOP PSNR (dB) BR (kbps) ABR (%) TR (%)
Table 2 Performance of the proposed DVC parallel decoder for 30fps CIF sequences (first stage of the proposed transcoder).
Foreman 2 33.30 1715.59 -1.18 -70.08
4 32.54 2478.72 -0.59 -72.18
8 31.29 3190.26 -0.01 -70.73
Hall 2 36.89 1378.83 4.04 -64.51
4 36.61 1459.45 6.83 -66.81
8 36.22 1516.75 0.09 -67.63
CoastGuard 2 33.54 2343.41 2.91 -68.20
4 32.34 2762.09 1.52 -73.76
8 30.91 3287.70 0.04 -74.29
Soccer 2 30.59 2412.65 -0.36 -71.19
4 29.97 3897.47 -0.50 -74.25
8 29.11 4935.22 0.00 -75.05
mean 1.07 -70.72
30 fps sequences
Reference DVC decoder Proposed DVC parallel decoder
Sequence GOP PSNR (dB) BR (kbps) ABR (%) TR (%)
The performance of the proposed DVC parallel decoding is shown in Tables 1 (for 15fps QCIF sequences) and 2 (for 30fps CIF sequences). Results for the first part of the transcoders are shown in Tables 1 and 2 for QCIF and CIF sequences. These tables do not include results for APSNR because the quality obtained by DVC parallel decoding is the same as the reference decoding, it iterates until a given threshold is reached [5].
As the results of Tables 1 and 2 show, when DVC decodes smaller and less complex parts, sometimes the turbo decoder
(as part of the DVC decoder) converges faster with less iterations and it implies less parity bits requested and thus a bitrate reduction. However, generally speaking the turbo codec yields a better performance for longer inputs. For this reason, the bitrate is not always positive or negative. Comparing different GOP lengths, in short GOPs most of the bitrate is generated by the K frames. When the GOP length increases, the number of K frames is reduced and then WZ frames contribute to reducing the global bitrate in low motion sequences (like Hall) or increasing it in high motion
Table 3 Performance of the proposed mapping method for H.264 baseline and main profile for QCIF sequences (15fps).
Baseline: IPPP H.264 pattern Main: IBBP H.264 pattern
15fps 30fps
Sequence GOP APSNR(db) ABR (%) TR (%) APSNR(db) ABR (%) TR (%)
Foreman 2 -0.02 0.57 -41.57 -0.01 0.01 -32.57
4 -0.03 0.86 -44.62 -0.02 0.32 -33.57
8 -0.04 1.11 -45.85 -0.01 0.06 -34.67
Hall 2 -0.01 0.41 -30.04 -0.01 0.24 -23.33
4 0.00 0.12 -30.77 -0.01 0.25 -22.61
8 0.00 0.17 -27.21 -0.01 0.19 -20.27
CoastGuard 2 -0.01 0.27 -47.46 -0.01 0.13 -46.85
4 -0.01 0.33 -46.15 0.00 0.07 -45.83
8 -0.01 0.20 -47.61 0.00 0.08 -45.49
Soccer 2 -0.01 0.19 -38.85 0.00 0.18 -31.43
4 -0.04 1.18 -43.35 -0.02 0.73 -34.77
8 -0.05 1.63 -44.98 -0.02 0.85 -36.87
mean -0.02 0.59 -40,71 -0.01 0.26 -34.02
Table 4 Performance of the proposed mapping method for H.264 baseline and main profile for CIF sequences (30fps).
Baseline: IPPP H.264 pattern
Main: IBBP H.264 pattern
Sequence GOP APSNR(db) ABR (%) TR (%) APSNR(db) ABR (%) TR (%)
Foreman 2 -0.01 0.01 -32.57 -0.09 1.56 -38.28
4 -0.02 0.32 -33.57 -0.07 1.35 -39.91
8 -0.01 0.06 -34.67 -0.06 1.33 -40.89
Hall 2 -0.01 0.24 -23.33 -0.01 0.24 -25.75
4 -0.01 0.25 -22.61 -0.01 0.13 -28.65
8 -0.01 0.19 -20.27 0.00 0.03 -27.16
CoastGuard 2 -0.01 0.13 -46.85 -0.02 0.51 -48.81
4 0.00 0.07 -45.83 -0.02 0.46 -48.12
8 0.00 0.08 -45.49 -0.02 0.41 -48.39
Soccer 2 0.00 0.18 -31.43 -0.05 1.98 -36.01
4 -0.02 0.73 -34.77 -0.06 2.36 -39.24
8 -0.02 0.85 -36.87 -0.06 2.38 -42.02
mean -0.01 0.26 -34.02 -0.04 1.06 -38.60
sequences (Foreman or Soccer). Generally, decoding smaller pieces of frame (in parallel) works better for high motion sequences, where the bitrate is similar or even lower in some cases.
Results for the second stage of the transcoder are shown in Tables 3, 4, 5, 6 and 7. In this case, both H.264/AVC/SVC encoders (reference and proposed) start from the same DVC output sequence (as DVC parallel decoding obtains the same quality as the reference DVC decoding), which is quantified
with four QP values. For these four QP values, APSNR and ABR are calculated as specified in Bjontegaard and Sullivan's common test rule [35]. TR is given by Eq. 3. In Table 3, DVC decoded sequences are mapped to an IPPP pattern. In this case RD loss is negligible and TR is around 40 %. For IBBP pattern (right part of Table 3), the accuracy of the proposed method works little better and RD loss is even lower. For the CIF sequences, depicted in Table 4, the conclusions are similar. Comparing both patterns, the IBBP pattern generates a slightly
Table 5 Performance of the proposed mapping method for H.264/SVC baseline and main profile for QCIF sequences (15fps) and 2 temporal layers.
Baseline: IPPP H.264/SVC pattern
1 Layer (7.5 fps) 2 Layers (15 fps)
Sequence GOP ABitrate APSNR ABitrate APSNR (%) (dB) (%) (dB)
Main: IBBBP H.264/SVC pattern
TR (%) 1 Layer (7.5 fps) 2 Layers (15 fps) TR (%)
ABitrate APSNR ABitrate APSNR (%) (dB) (%) (dB)
Foreman 2 4 8
Hall 2
CoastGuard 2 4 8
Soccer 2
1.57 -0.06
2.17 -0.09
1.84 -0.07
0.09 0.00
0.04 0.00
0.41 -0.02
1.14 -0.06
1.45 -0.07
2.30 -0.09
1.05 -0.04
1.84 -0.08
2.90 -0.12
1.39 -0.06
2.03 -0.07
3.79 -0.13
3.44 -0.12
0.11 0.00
0.05 0.00
0.30 -0.01
1.17 -0.05
1.58 -0.06
2.10 -0.08
0.56 -0.03
4.35 -0.15
5.42 -0.18
2.05 -0.07
83.03 -0.25
85.15 -0.06
85.39 -0.23
84.95 -0.07
85.87 0.06
86.04 -0.01
84.79 0.01
85.80 0.09
85.86 -0.22
81.38 -0.56
84.61 -0.47
84.89 -0.82
84.81 -0.21
0.01 0.39
0.00 0.71
0.01 0.43
0.00 0.19
0.00 -0.13
0.00 -0.02
0.01 -0.07
0.00 -0.02
0.01 0.28
0.03 1.05
0.02 1.56
0.03 2.04
0.01 0.53
0.01 -71.41
0.03 -73.08
0.02 -73.46
0.01 -74.67
0.00 -74.70
0.00 -74.74
0.00 -74.33
0.00 -74.52
0.01 -74.58
0.04 -68.76
0.06 -71.67
0.07 -72.21
0.02 -73.18
Table 6 Performance of the proposed mapping method for H.264/SVC baseline profile for CIF sequences (30fps) and 3 temporal layers.
Baseline: IPPP H.264/SVC pattern
1 Layer (7.5 fps) 2 Layers (15 fps) 3 Layers (30 fps) TR (%)
Sequence GOP ABitrate (%) APSNR (dB) ABitrate (%) APSNR (dB) ABitrate (%) APSNR (dB)
Foreman 2 1.31 -0.08 1.55 -0.07 1.88 -0.07 -79.10
4 0.10 0.00 0.08 0.00 0.07 0.00 -81.16
8 1.47 -0.07 1.45 -0.06 1.55 -0.05 -80.87
Hall 2 1.54 -0.06 2.43 -0.10 2.75 -0.10 -75.79
4 1.32 -0.08 1.78 -0.07 2.10 -0.07 -79.84
8 0.12 0.00 0.07 0.00 0.13 0.00 -81.29
CoastGuard 2 1.42 -0.07 1.53 -0.06 1.55 -0.05 -80.95
4 1.50 -0.06 2.06 -0.07 3.39 -0.11 -78.04
8 0.82 -0.04 1.52 -0.06 1.77 -0.05 -80.27
Soccer 2 0.05 0.00 0.07 0.00 0.21 -0.01 -81.44
4 1.63 -0.07 1.52 -0.05 1.55 -0.05 -81.05
8 0.98 -0.04 2.47 -0.09 3.89 -0.11 -78.50
mean 1.02 -0.05 1.38 -0.05 1.74 -0.06 -79.86
higher RD loss, but H.264 encoding is performed faster (up to 38 %). This is because B frames have two reference frames, but dynamic ME search area reduction is carried out in both of them.
For the scalable extension, H.264/SVC, Tables 5, 6 and 7 show the performance. Regarding the Baseline Profile with encoding pattern IPPP, Table 5 shows the RD penalty
measured for QCIF sequences at 15 fps encoded using two temporal layers and the TR of the proposed transcoder.
In Table 5, the results in the two layers column accumulate the RD of both layers. Generally, the RD results show that the drop penalty is not significant, being a little higher for highmotion sequences such as Soccer. Comparing the different
Table 7 Performance of the proposed mapping method for H.264/SVC main profile for CIF sequences (30fps) and 3 temporal layers.
Main: IBBBP H.264/SVC pattern
1 Layer (7.5 fps) 2 Layers (15 fps) 3 Layers (30 fps) TR (%)
Sequence GOP ABitrate (%) APSNR (dB) ABitrate (%) APSNR (dB) ABitrate (%) APSNR (dB)
Foreman 2 1.31 -0.08 1.48 -0.07 1.75 -0.06 -79.30
4 1.32 -0.08 1.90 -0.07 2.09 -0.07 -79.99
8 0.82 -0.04 1.42 -0.05 1.78 -0.05 -80.17
Hall 2 0.10 0.00 0.08 0.00 0.08 0.00 -81.26
4 0.12 0.00 0.09 0.00 0.13 0.00 -81.44
8 0.05 0.00 0.17 -0.01 0.15 -0.01 -81.45
CoastGuard 2 1.47 -0.07 1.44 -0.06 1.43 -0.05 -80.98
4 1.42 -0.07 1.56 -0.06 1.46 -0.05 -81.03
8 1.63 -0.07 1.51 -0.05 1.58 -0.05 -81.06
Soccer 2 1.54 -0.06 2.46 -0.10 2.86 -0.10 -75.90
4 1.50 -0.06 2.15 -0.08 3.39 -0.09 -78.13
8 0.98 -0.04 2.44 -0.09 3.91 -0.11 -78.53
mean 1.02 -0.05 1.39 -0.05 1.72 -0.05 -79.94
Table 8 Overall performance of the proposed transcoder for H.264 baseline and main profiles for 15fps QCIF sequences.
Baseline: IPPP H.264 pattern Main: IBBP H.264 pattern
Sequence GOP ABR(%) ATR(%) A£R(%) ATR(%
APSNR(dB) APSNR{dB)
Foreman 2 -0.02 0.64 -75.50 -0.03 0.67 -75.12
4 -0.02 -0.05 -70.66 -0.03 -0.03 -70.51
8 -0.01 -1.02 -76.78 -0.03 -1.01 -76.63
Hall 2 0.00 -2.47 -74.48 0.00 -2.47 -73.99
4 0.00 7.86 -70.01 0.00 7.86 -69.66
8 0.00 8.91 -68.92 0.00 8.92 -68.63
CoastGuard 2 0.00 1.07 -72.41 -0.01 1.10 -72.00
4 0.00 1.35 -74.18 -0.01 1.37 -73.92
8 0.00 1.88 -74.52 -0.02 1.89 -74.31
Soccer 2 -0.03 -2.62 -73.82 -0.04 -2.61 -73.45
4 -0.03 -2.78 -75.61 -0.04 -2.77 -75.39
8 -0.03 -3.15 -73.79 -0.03 -3.15 -73.64
mean -0.01 0.80 -73.39 -0.02 0.81 -73.10
layers, the second layer suffers more RD penalty, since it depends on the RD achieved by the base layer. Even though, the RD penalty on average is around 2 % in bitrate increment and 0.07 dB in quality loss. On the other hand, the transcoding time is greatly reduced, reaching 84.81 % of TR on average. Moreover, Table 5 also shows the performance for the Main Profile with an IBBBP pattern for 15 fps QCIF sequences. It exhibits negligible RD distortion for every layer; even for low-complexity sequences, such as Hall or CoastGuard. The coding efficiency is better than in the reference transcoder because the MVs stored have an impact on the TR, but it is
not very significant for the RD, since the search method is based on the SAE function. The TR achieved is 73.18 % on average, so the time consumed by SVC encoding is greatly reduced by using the MVs generated in the WZ decoding stage. For different WZ GOP sizes, the results are similar, since the proposed MV extraction method is similar for every WZ GOP size.
For the H.264/SVC CIF sequences, the results have been divided in two tables (Tables 6 and 7) due to space limitations. Table 6 show results for 30 fps sequences (CIF resolutions). In this way, for CIF sequences, the encoding time is reduced by up
Table 9 Overall performance of the proposed transcoder for H.264 baseline and main profiles for 15fps QCIF sequences.
Baseline: IPPP H.264 pattern
Main: IBBP H.264 pattern
Sequence
APSNR(dB)
ABR(%)
ATR{%)
APSNR(dB)
ABR(%)
ATR(%)
Foreman 2 -0.02 -0.16 -68.68 -0.02 -0.01 -68.31
4 -0.02 0.35 -71.02 -0.03 0.45 -70.87
8 -0.01 0.33 -69.63 -0.02 0.40 -69.55
Hall 2 -0.01 4.45 -62.97 -0.01 4.50 -62.52
4 -0.01 7.05 -65.55 -0.01 7.10 -65.33
8 0.00 9.43 -66.43 0.00 9.46 -66.27
CoastGuard 2 -0.01 3.26 -66.89 -0.01 3.36 -66.57
4 -0.01 2.17 -72.61 -0.01 2.24 -72.45
8 -0.01 3.97 -73.19 -0.01 4.05 -73.09
Soccer 2 -0.02 0.56 -69.92 -0.03 0.76 -69.61
4 -0.02 0.49 -73.14 -0.04 0.61 -73.02
8 -0.02 0.85 -73.98 -0.03 0.95 -73.90
mean -0.01 2.73 -69.50 -0.02 2.82 -69.29
Table 10 Overall performance of the proposed transcoder for H.264/SVC baseline and main profiles for 15fps QCIF sequences.
Baseline: IPPP H.264 pattern Main: IBBBP H.264 pattern
Sequence GOP ABR(%) ATR(%) ABR(%) ATR(%
APSNR(dB) APSNR{dB)
Foreman 2 0.03 1.87 -76.28 0.01 1.34 -74.43
4 0.05 1.29 -71.39 0.00 0.88 -70.19
8 0.05 0.30 -76.70 0.01 0.01 -75.79
Hall 2 0.00 -1.25 -76.66 0.00 -1.25 -74.26
4 0.00 7.67 -72.66 0.00 7.73 -70.48
8 0.00 8.77 -71.52 0.00 8.79 -69.48
CoastGuard 2 0.00 1.87 -74.08 0.00 1.61 -72.24
4 0.01 2.16 -74.96 0.00 1.99 -73.59
8 0.01 2.68 -74.94 0.00 2.49 -73.81
Soccer 2 0.06 -0.96 -74.13 0.03 -1.25 -72.67
4 0.06 -1.08 -75.54 0.03 -1.42 -74.58
8 0.07 -1.44 -73.66 0.03 -1.77 -72.90
mean 0.03 1.82 -74.38 0.01 1.60 -72.87
to 79.86 % on average, whereas slightly better RD results are obtained than in the case of QCIF sequences. Finally, for CIF sequences at 30 fps in Main profiler, the encoding time (Table 7) is reduced by up to 79.94 % with only a negligible RD penalty.
Finally, to analyze the global transcoding improvement, Tables 8, 9, 10 and 11 summarize global transcoding performance. In this case, Bjontegaard and Sullivan's common test rule was not used because it is a recommendation only for H.264/AVC/SVC. Then, to estimate the PSNR obtained by the transcoder, the original sequences were compared with
the output sequences after transcoding. For each four QP points, the PSNR measured is displayed as an average (APSNR). To estimate the BR generated by the reference and the proposed transcoder, the BR generated by both stages (DVC decoding and H.264 encoding) was added. Then Eq. 1 was applied and it was averaged for each four H.264 QPs (ABR). As the DVC decoding contributes with most of the bitrate, results are very similar to those in Tables 1 and 2. In order to evaluate the TR, total transcoding time was measured for the reference and proposed transcoder. Then
Table 11 Overall performance of the proposed transcoder for H.264/SVC baseline and main profiles for 30fps CIF sequences.
Baseline: IPPP H.264 pattern
Main: IBBBP H.264 pattern
Sequence
APSNR(dB)
ABR(%)
ATR{%)
APSNR(dB)
ABR(%)
ATR(%)
Foreman 2 0.01 0.10 -71.03 0.01 0.10 -71.07
4 0.01 0.56 -71.92 0.01 0.56 -71.93
8 0.02 0.45 -70.32 0.02 0.45 -70.31
Hall 2 0.00 4.51 -68.55 0.00 4.50 -68.58
4 0.00 7.09 -68.15 0.00 7.09 -68.17
8 0.00 9.51 -68.33 0.00 9.51 -68.33
CoastGuard 2 0.01 3.42 -70.04 0.01 3.42 -70.07
4 0.00 2.33 -73.48 0.00 2.33 -73.49
8 0.01 4.05 -73.74 0.01 4.05 -73.74
Soccer 2 0.05 0.92 -70.90 0.05 0.92 -70.91
4 0.04 0.74 -73.51 0.04 0.75 -73.51
8 0.03 1.07 -74.21 0.03 1.07 -74.21
mean 0.02 2.90 -71.18 0.02 2.90 -71.19
Figure 13 DVC decoder parallel performance.
U TO LL
Q. 3 ■a
o u <u Q
4 3.5 3 2.5 2 1.5 1
1 2 3 4 5 6 7 8 Number of Threads
100 & 90
30 20 10
■ 11
1 2 3 4 5 6 7 8 Number of Threads
Eq. 3 was applied and a mean was calculated for each of the four H.264 QPs (ATR). As DVC decoding takes up most of the transcoding time, improvements in this stage have a bigger influence on the overall transcoding time, and so the TR obtained is similar to that in Tables 1 and 2, reducing the complexity of the transcoding process around to 70 % (on average).
5.3 Hardware Usage
This section continues the study of the performance of the proposed hybrid-based algorithm, described in Section 4.1. In particular, the approach has been evaluated by modifying the number of threads from 1 to 8 (since the four-core processor used for testing, an Intel i7 940 [32], can execute 8 threads at the same time). Figure 13 shows the performance of the multicore architecture executing the proposed parallel WZ decoder. Results were measured for the Foreman sequence at 15 fps and a GOP size of 2, since there is no influence in a particular sequence. Figure 13a shows the speed-up factor achieved. The speed-up factor increases most for 6 threads, and the maximum speed-up factor is around 3.75 on a 4-core processor, which represents a
performance close to the theoretical maximum, namely 4. Finally, Fig. 13b shows the overall CPU usage as a percentage varying the number of threads. Values measured by the performance monitor are summarized as an average. CPU usage increases in a linear way, achieving around 75 % as maximum, due to the synchronization barriers and the idle status at the end of the decoding process. As can be observed, the first 4 threads increase the speed-up and CPU usage, whereas the following 4 threads achieve less increment. This happens because the first 4 threads are each executed alone on a different core. When more threads are used, threads begin working on shared cores, so the time reduction is lower (hyper-threading allows allocating up to two processes per core). The theoretical measures are not achieved basically due to that of synchronization barriers. In other words, there are threads waiting for a synchronization barrier to reconstruct a frame. Other waste time is spent on critical sections for shared structures or when the modules are initialized. Moreover, there are other modules of the DVC decoder such as SI which are not parallelized. Although these modules do not represent too much time (see Fig. 3).
In order to analyze CPU usage itself, this was measured every second as a percentage of overall CPU usage. As
Figure 14 Analysis of the CPU usage timeline (in seconds).
Table 12 Comparison between parallel proposal and [15].
Sequence TR (%) TR Proposal(%)
QP 1 QP 4 QP 6 QP 8
Foreman -20.00 -46.67 -52.00 -60.71 -74.19
CoastGuard -33.33 -80.00 -64.71 -70.21 -71.23
Fig. 14 shows for the hybrid-based WZ decoding proposal, when the decoding process starts there is an initialization time. Subsequently, synchronization barriers for each frame make the CPU usage oscillate between 45 and 100 %. Towards the end, CPU usage decreases constantly until the decoding of the last GOP is finished. CPU usage on average for the whole decoding process is 80.24 %. On the other hand, the reference decoder (sequential) exhibits a constant CPU usage of 14 %. In other words, the present approach makes better use of the CPU resources.
5.4 Comparison with Other Proposal
This section compares the proposed parallel algorithms with those which were described in Section 3. RD results are not taken into account because they are negligible, or these results do not appear in the paper or other related work by the authors. Table 12 compares the results obtained by Kubasov et al. [15] and our parallel hybrid-based proposal. In [15] the results are taken with Foreman and CoastGuard QCIF sequences at 15fps, GOP=2. The approach presented in [15] achieves a greater time reduction when more BPs are encoded but, on average, our parallel approach outperforms in terms of TR.
Morbee et al. propose in [16] a rate allocation algorithm for DVC architectures. This proposal is implemented by using H.263 to encode K frames. In the results, the QCIF sequences are encoded from 1 to 3 BPs at 15 fps with a fixed QP in H.263 of 20 (Table 13). Compared with the parallel hybrid-based proposal, the TR achieved by [16] is much lower. It should be said that a fixed QP is not realistic. The parallel hybrid-based proposal changes the QP depending on
Table 13 Comparison between parallel proposal and [16].
Sequence TR (%) QP 1 QP 2 QP 3 TR Proposal (%)
Foreman -48.43 -55.76 -64.49 -74.19
Hall -44.37 -41.59 -34.34 -71.90
CoastGuard -62.90 -52.84 -54.34 -71.23
Table 14 Comparison between parallel proposal and [17]. Solution 1.
Sequence TR Mean (%) TR Proposal (%)
Foreman -51.75 -74.19
Hall -20.00 -71.90
CoastGuard -37.87 -71.23
Soccer -60.37 -74.62
the number of BPs used and the particular sequence in question, in order to ensure a constant quality between WZ and K frames. Hence, the TR achieved by the parallel proposal is quite a lot better than the one evaluated by [16].
Additionally, the proposed algorithm is compared with the rate control scheme proposed by Areia et al. in [17]. In [17] a proposal is made for a hybrid rate control with a constant (Table 14) and a selective (Table 15) final number of rate chunks. In both cases, the parallel hybrid-based proposal presented in Section 4.1 achieves a greater time reduction.
Finally, the proposed parallel WZ decoding algorithm is compared with the algorithm proposed by Momcilovic et al. in [14] for LDPC decoding. In particular, in [14] three decoding algorithms are considered: Sum-Product BP, Min-Sum BP and Algorithm E; and the LUT algorithm as a variation of the Sum-Product BP algorithm.
Fig. 15 compares the speed-up factor achieved by [14] and the four variations of LDPC decoding (executed on a four-core processor) with the parallel WZ decoding algorithm. In order to evaluate the performance of the proposed transcoding approach against other proposals available in the literature, which were introduced in Section 3.2, several sequences were encoded using different patterns.
Table 16 shows a comparison between the proposed transcoder and two different WZ transcoding proposals: Peixoto et al. [21] (WZ to H.263 transcoder) and 'Martinez et al. [22] (WZ to H.264/AVC transcoder). The simulations presented were carried out keeping the same test conditions as the other authors' proposals with the aim of offering a fair comparison. On the one hand, comparing with Peixoto et al.
Table 15 Comparison between parallel proposal and [17]. Solution 2.
Sequence TR Mean (%) TR Proposal (%)
Foreman -61.37 -74.19
Hall -25.87 -71.90
CoastGuard -46.50 -71.23
Soccer -69.37 -74.62
[MYRV10] [MYRV10] LUT [MYRV10] [MYRV10] Hybrid-based Sum-Prod MinSum Alg E Proposal 4.6
Figure 15 Comparison of Speedup between proposal and [14].
[21], the proposed transcoder achieves a better performance for patterns with P frames with a negligible RD loss when B frames are introduced. This is due to the fact that H.263 uses MB partitions that are more similar to WZ than H.264/AVC and the MB partitions and sub-partitions can be smaller, such as 4x4.
But, comparing the acceleration of the two transcoders, namely H.264/AVC and H.263, is not possible, since in [21] time results do not appear, which is indicated in Table 16 as n.a. This is the major fault of the proposed transcoder based on H.263 [22], as some rate distortion penalty is allowed only if the corresponding time reduction is above acceptable values. In addition, H.264/AVC generates lower bitrates than
Table 16 Performance of proposed WZ-to-H.264/AVC transcoder compared with other state-of-the-art proposals.
RD performance comparison
Proposal Sequence Pattern fps ABitrate (%) APSNR (dB) TR (%)
Proposed Transcoder Foreman IPIP 15 0.19 -0.009 -62.57
Peixoto et al. [21] Foreman IPIP 15 1.08 -0.063 n.a.
Martinez et al. [22] Foreman IPIP 15 0.77 -0.034 -71.61
Proposed Transcoder Soccer IPIP 15 1.23 -0.046 -59.30
Peixoto et al. [21] Soccer IPIP 15 1.38 -0.102 n.a.
Martinez et al. [22] Soccer IPIP 15 2.02 -0.076 -47.98
Proposed Transcoder Foreman IPPPI 30 0.21 -0.008 -63.79
Peixoto et al. [21] Foreman IPPPI 30 2.83 -0.067 n.a.
Martinez et al. [22] Foreman IPPPI 30 0.93 -0.039 -73.99
Proposed Transcoder Foreman IBP 15 3.83 -0.142 -61.26
Peixoto et al. [21] Foreman IBP 15 1.98 -0.102 n.a.
Martinez et al. [22] Foreman IBP 15 9.87 -0.394 -72.42
Proposed Transcoder Soccer IBP 15 4.10 -0.164 -69.22
Peixoto et al. [21] Soccer IBP 15 2.37 -0.149 n.a.
Martinez et al. [22] Soccer IBP 15 13.09 -0.493 -79.07
Table 17 Performance of proposed WZ-to-H.264/AVC transcoder compared with other state-of-the-art proposals.
RD performance comparison
Proposal Resolution GOP size APSNR (dB) TR (%)
Proposed CIF 2 -0.07 -79.10
Dziri et al. [27] CIF 2 -0.50 -47.00
Al-Muscati et al. [28] CIF 2 -0.50 -37.00
Garrido-Cantos et al. [29] CIF 2 -0.01 -41.79
Proposed QCIF 8 -0.04 -78.00
Dziri et al. [27] QCIF 8 n.a. n.a.
Al-Muscati et al. [28] QCIF 8 -0.20 -55.20
Garrido-Cantos et al. [29] QCIF 8 -0.03 -51.90
H.263, so negligible bitrate losses in H.264/AVC have a bigger influence on ABitrate in percentage terms.
Furthermore, this proposal greatly improves upon the RD results achieved by Martínez et al. [22], for each H.264/AVC pattern. The main part of this RD improvement is achieved by the addition of the minimum radial length for the reduced search area and the dynamic scaling according to the distance of the reference frame.
To evaluate the performance of the proposed WZ to H.264/SVC transcoder against other proposals based on H.264/SVC transcoding with temporal scalability for the Baseline Profile, here a study is carried out with other similar proposals available in the literature. Since in this thesis the first ever WZ to H.264/SVC transcoding approach is proposed, the present proposal is compared with other transcoding approaches based on H.264/AVC to H.264/SVC transcoding. In particular, in Table 17, the proposed transcoder is compared with Dziri et al. [27], Al-Muscati et al. [28] and Garrido-Cantos etal. [29].
In the performance comparison in Table 17, the Foreman sequence was chosen (in QCIF with GOP 8 and CIF with GOP 2), as it is the only common sequence across all the works. In addition, there is no numerical information related to ABitrate in [27, 28]. It should be noted that the same sequences have been selected in order to make a fair comparison. In spite of the RD not being significant in all the proposals, our transcoding method achieves up to 79.10 % of TR, whereas other proposals do not reduce it by more than 55.2 % of the transcoding time.
6 Conclusions
This paper presents a DVC to H.264 transcoder designed to support mobile-to-mobile video communications. Moreover, the proposed transcoder is extended to H.264/SVC. Since the transcoder device accumulates the highest complexity from both video paradigms, reducing the time spent in this process is an important goal. With this aim, in this paper two approaches are proposed to improve DVC decoding and H.264/AVC/SVC encoding. With both approaches a time reduction of up to 70 % is achieved for the complete transcoding process with negligible RD losses. In addition, the presented transcoder performs a mapping for different GOP patterns and lengths between the two paradigms by using an adaptive algorithm, which takes into account the MVs gathered in the side information generation process. In a nutshell, this work offers a straightforward step in the framework of fast and flexi/AVC/SVCble DVC to H.264 video transcoders with temporal scalability. Ongoing work aims to explore other parallel schemes and to try a DVC architecture without a feedback channel, in order to test with a more realistic DVC framework.
Acknowledgments This work has been jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2012-38341-C04-04 The work presented was carried out by using the VISNET2-WZ-IST software developed in the framework of the VISNET II project.
References
1. ISO/IEC International Standard 14496-10. (2003). Information Technology—Coding of Audio—Visual Objects—Part 10: Advanced Video Coding.
2. ITU-T and ISO/IEC JTC 1. (2009 March). Advanced Video Coding for Generic Audiovisual Services. ITU-T Rec. H.264/AVC and ISO/IEC 14496-10 (including SVC extension).
3. Schwarz, H., Marpe, D., & Wiegand, T. (2007). Overview of the scalable video coding extension of the H.264/AVC standard. IEEE Transactions on Circuits and Systems for Video Technology, 17(9), 1103-1120.
4. Aaron, A., Rui, Z., & Girod, B. (2002 Nov). Wyner-Ziv coding of motion video. In Asilomar Conference on Signals, Systems and Computers (pp. 240-244). Pacific Grove, USA. doi:10.1109/ ACSSC.2002.1197184.
5. Brites, C., Ascenso, J., Quintas Pedro, J., & Pereira F. (2008). Evaluating a feedback channel based transform domain Wyner-Ziv video codec. Signal Processing: Image Communication, 23(4), 269297. doi:10.1016/j.image.2008.03.002.
6. DVEO Professional Broadcast Quality Subsystems. http://www. dveo.com/. Accessed 7 Aug 2013.
7. Girod, B., Aaron, A. M., Rane, S., & Rebollo-Monedero, D. (2005). Distributed video coding. Proceedings of the IEEE, 93(1), 71-83.
8. Badem, M., Fernando, W. A. C., & Kondoz, A. (2010). Transform domain distributed video coding with spatial correlations. Multimedia Tools and Applications, 48(3), 369-379.
9. Ascenso, J., Brites, C., & Pereira, F. (2010). A flexible side information generation framework for distributed video coding. Multimedia Tools and Applications, 48(3), 381-409.
10. Artigas, X., Ascenso, J., Dalai, M., Klomp, S., Kubasov, D., & Ouaret, M. (2007). The DISCOVER codec: Architecture, techniques and evaluation. In Picture Coding Symposium (PCS) (pp. 1-4). Lisbon, Portugal: Citeseer.
11. Ascenso, J., Brites, C., Dufaux, F., Fernando, A., Ebrahimi, T., Pereira, F., et al. (2010 August). The VISNET II DVC Codec: Architecture, Tools and Performance. In European Signal Processing Conference (EUSIPCO), Aalborg, Denmark.
12. ISO/IEC 14486-2 PDAM1. (1999). Infomation Technology-Generic Coding of Audio-Visual Objects- Part 2: Visual.
13. Ryanggeun, O., Jongbin, P., & Byeungwoo, J. (2010 March). Fast implementation of Wyner-Ziv Video codec using GPGPU. In IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB) (pp. 1-5). Shanghai, China. doi:10.1109/ ISBMSB.2010.5463150.
14. Momcilovic, S., Yige, W., Rane, S., & Vetro, A. (2010 October). Toward realtime side information decoding on multi-core processors. In IEEE International Workshop on Multimedia Signal Processing (MMSP) (pp. 321-326). Saint-Malo, France. doi:10. 1109/MMSP.2010.5662040.
15. Kubasov, D., Lajnef, K., & Guillemot, C. A. (2007 October). Hybrid Encoder/Decoder Rate Control for Wyner-Ziv Video Coding with a Feedback Channel, In IEEE 9th Workshop on Multimedia Signal Processing, (MMSP) (pp 251-254). Chaina, Crete, Greece. doi:10.1109/MMSP.2007.4412865.
16. Morbee, M., Roca, A., Prades-Nebot, J., Pizurica, A., & Philips, W. (2008). Reduced decoder complexity and latency in pixel-domain Wyner-Ziv video coders. Signal, Image and Video Processing, 2(2), 129-140.
17. Areia, J., Ascenso, J., Brites, C., & Pereira, F. (2008 August). Low complexity hybrid rate control for lower complexity Wyner-Ziv video decoding. In 16th European Signal Processing Conference (EUSIPCO), Lausanne, Switzerland.
18. Fernández-Escribano, G., Cuenca, P., Orozco-Barbosa, L., Garrido, A., & Kalva, H. (2008). Simple intra prediction algorithms for heterogeneous MPEG-2/H. 264 video transcoders. Multimedia Tools and Applications, 38(1), 1-25.
19. Fernández, K., & Cuenca, O. (2007). A first approach to speeding-up the inter mode selection in MPEG-2/H. 264 transcoders using machine learning. Multimedia Tools and Applications, 35, 225-240.
20. De Cock, J., Notebaert, S., Vermeirsch, K., Lambert, P., & Van de Walle, R. (2010). Dyadic spatial resolution reduction transcoding for H. 264/AVC. Multimedia Systems, 16(2), 139-149.
21. Peixoto, E., Queiroz, R. L., & Mukheijee, D. (2010). A Wyner-Ziv video transcoder. IEEE Transactions on Circuits and Systems for Video Technology, 20(2), 189-200.
22. Martínez, J. L., Fernández-Escribano, G., Kalva, H., Fernando, W. A. C., & Cuenca, P. (2009). Wyner-Ziv to H.264 video transcoder for low cost video encoding. IEEE Transactions on Consumer Electronics, 55(3), 1453-1461.
23. Corrales-García, A., Martínez, J. L., & Fernandez-Escribano, G. (2010 October). Reducing DVC Decoder Complexity in a Multicore System, In IEEE International Workshop on Multimedia Signal Processing (MMSP) (pp. 315-320). Saint-Malo, France. doi:10.1109/MMSP.2010.5662039.
24. Corrales-García, A., Martínez, J. L., Fernandez-Escribano, G., Quiles, F. J., & Fernando, W. A. C. (2011 October). Wyner-Ziv Frame Parallel Decoding Based on Multicore Processors. In IEEE International Workshop on Multimedia Signal Processing (MMSP) (pp. 1-6). Hangzhou, China. doi:10.1109/MMSP.2011.6093835.
25. Corrales-García, A., Martínez, J. L., Fernandez-Escribano, G., & Quiles, F. J. (2012 January). Forward Wyner-Ziv Fast Video Decoding Using Multicore Processors, In International Conference on MultiMedia Modeling (MMM), vol. 7131 (pp. 574-584). Klagenfurt, Austria. doi:10.1007/978-3-642-27355-1_53.
26. Sheng, T., Zhu, X., Hua, G., Guo, H., Zhou, J., & Chen, C. (2010). Feedback-free rate-allocation scheme for transform domain Wyner-Ziv video coding. Multimedia Systems, 16(2), 127-137. doi:10.1007/s00530-009-0179-8.
27. Dziri, A., Diallo, A., Kieffer, M., & Duhamel, P. (2008 August). P-Picture Based H.264/AVC to H.264/SVC Temporal Transcoding, In International Wireless Communications and Mobile Computing Conference (IWCMC), (pp. 425-430). Crete Island, Greece.
28. Al-muscati, H., & Labeau, F. (2010 July). Temporal Transcoding of H.264/AVC Video to the Scalable Format. In 2nd International Conference on Image Processing, Theory Tools and Applications (IPTA), (pp. 138-143). Paris, France. doi:10.1109/IPTA.2010.5586733.
29. Garrido-Cantos, R., de Cock, J., Martínez, J. L., Van Leuven, S., Cuenca, P., Garrido, A., & Van de Walle, R. (2010 October). Video Adaptation for Mobile Digital Television. In 4th IFIP Wireless and Mobile Networking Conference (WMNC), (pp. 1-6). Budapest, Hungary. doi:10.1109/WMNC.2010.5678746.
30. Ascenso, J., Brites, C., & Pereira, F. (2005 June). Improving frame interpolation with spatial motion smoothing for pixel domain distributed video coding. In Speech and Image Processing, Multimedia Communications and Services (EURASIP). Smolenice, Slovak Republic.
31. The OpenMP API specification for parallel programming. http:// openmp.org. Accessed 7 Aug 2013.
32. Intel Processor Core family. http://www.intel.com/. Accessed 7 Aug 2013.
33. JM H.264/AVC Reference Software, Version 17.2, http://iphome.hhi.de. Accessed 07 Aug 2013.
34. Joint Video Team JSVM reference software, Version 9.19.3. http:// www.hhi.fraunhofer.de/de/kompetenzfelder/image-processing/
research-groups/image-video-coding/svc-extension-of-h264avc/ jsvm-reference-software.html. Accessed 7 Aug 2013. 35. Sullivan, G., & Bjontegaard, G. (2001). Recommended Simulation Common Conditions for H.26L Coding Efficiency Experiments on Low-Resolution Progressive-Scan Source Material. In ITU-T
Alberto Corrales-Garcia received his B.Sc. degree in Computer Engineering in 2008, the M.Sc degree in Computer Sciences in 2009 and Ph.D in 2012, from the University of Castilla-La Mancha, Albacete, Spain. In 2009 he joined the Albacete Research Institute of Informatics (I3A), University of Castilla-La Mancha. His research interests include video coding, multimedia communications, video transcoding and parallel data processing. He has also been a visiting researcher at the University of Surrey, Guildford (UK), in the Centre for Vision, Speech and Signal Processing (CVSSP).
Rafael Rodríguez-Sánchez
obtained his M.S. and Ph.D. degrees in Computer Science from the University of Castilla-La Mancha, Spain, in 2010 and 2013, respectively Since 2008 to 2013 he worked at the Albacete Research Institute of Informatics, University of Castilla-La Mancha, Spain. In 2013, he joined the Department of Engineering and Computer Sciences at Jaime I University, Castellón, Spain. His research interests include video coding, parallel programming, heterogeneous computing, and power and energy consumption
j< - Jose Luis Martinez received his
M.S and Ph.D. degrees in Computer Science and Engineering from the University of Castilla-La BJt X^H Mancha, Albacete, Spain in 2007
. _ . H^H and 2009 respectively. In 2005, he
HR ** * r^H joined the Department of Comput-
er Engineering at the University of Castilla-La Mancha, where he was a researcher of Computer Architecture and Technology group at the Albacete research institute of ^^^^^ informatics (I3A). In 2010, he
joined the department of Computer Architecture of Complutense University in Madrid where he was assistant professor. In 2011 he come back to University of Castilla-La Mancha and joints the Department of Computer Engineering where he is assistant professor. His research interests
include Distributed Video Coding (DVC), multimedia standards, video transcoding, parallel video processing. He has also been a visiting researcher at the Florida Atlantic University, Boca Raton (USA) and Centre for Communication System Research (CCSR), at the University of Surrey, Guildford (UK). He has over 40 publications in these areas in international refereed journals and conference proceedings. He is a member of the IEEE.
lished over 50 papers in international Journals and Conferences and participated in 12 research projects. He also has a USA Patent jointly together with Dr. Hari Kalva. He has also been a visiting researcher at the Florida Atlantic University, Boca Raton, six times (pre and post doctoral research stays during more than eighteen months in total) and at the Friedrich Alexander Universität, Erlangen-Nuremberg, Germany.
Gerardo Fernandez-Escribano
received the M.Sc. degree in Computer Engineering and the Ph.D. degree from the University of Castilla-La Mancha, Albacete, Spain, in 2003 and 2007, respectively. In 2004, he joined the Department of Computer Engineering at the University of Castilla-La Mancha, where he is currently an Associate Ph.D. Professor at the School of Industrial Engineering, where he teaches Industrial Informatics. Since September 2010, he is the Master's Thesis and Professor Coordinator of the Master's Degree in Education for High School Professors at the University of Castilla-La Mancha. As a researcher he is a member of Computer Architecture and Technology Area. His research interests include multimedia standards (MPEG, H.264/AVC, HEVC and DVC), video transcoding, video compression, video transmission, and machine learning mechanism. He has pub-
Francisco Jose Quiles Francisco J. Quiles received his degree in Physics (Electronics and Computer Science) and Ph.D. degree from the University of Valencia, Spain, in 1986 and 1993, respectively. In 1986 he joined the Computer Systems Department at the University of Castilla-La Mancha, where he is currently a Full Professor of Computer Architecture and Technology. His research interests include: highperformance networks, parallel algorithms for video compression and video transmission, Distributed Video Coding (DVC), 3D video. He has developed several courses on computer organization and computer architecture. He has published over 200 papers in international journals and conferences and participated in 68 research projects. Also, he has guided over 9 doctoral thesis. Currently he is chair of Computing Systems Department.