Scholarly article on topic 'Speeding up a Video Summarization Approach Using GPUs and Multicore CPUs'

Speeding up a Video Summarization Approach Using GPUs and Multicore CPUs Academic research paper on "Computer and information sciences"

Share paper
Academic journal
Procedia Computer Science
OECD Field of science
{"Video Summarization" / GPUs / "Multicore CPUs"}

Abstract of research paper on Computer and information sciences, author of scientific article — Suellen S. de Almeida, Antônio Carlos de Nazaré Júnior, Arnaldo de Albuquerque Arauújo, Guillermo Cámara-Chávez, David Menotti

Abstract The recent progress of digital media has stimulated the creation, storage and distribution of data, such as digital videos, generating a large volume of data and requiring efficient technolo- gies to increase the usability of these data. Video summarization methods generate concise summaries of video contents and enable faster browsing, indexing and accessing of large video collections, however, these methods often perform slow with large duration and high quality video data. One way to reduce this long time of execution is to develop a parallel algorithm, using the advantages of the recent computer architectures that allow high parallelism. This paper introduces parallelizations of a summarization method called VSUMM, targetting either Graphic Processor Units (GPUs) or multicore Central Processor Units (CPUs), and ultimately a sensible distribution of computation steps onto both hardware to maximise performance, called “hybrid”. We performed experiments using 180 videos varying frame resolution (320 × 240, 640 × 360, and 1920 × 1080) and video length (1, 3, 5, 10, 20, and 30minutes). From the results, we observed that the hybrid version reached the best results in terms of execution time, achieving 7× speed up in average.

Academic research paper on topic "Speeding up a Video Summarization Approach Using GPUs and Multicore CPUs"

Speeding up a Video Summarization Approach using GPUs

and Multicore CPUs

Suellen S. de Almeida1, Antonio Carlos de Nazaré Júnior1, Arnaldo de Albuquerque Araújo1, Guillermo Camara-Chávez2, and David Menotti2

1 Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil {suellen.almeida, antonio.nazare arnaldo} 2 Federal University of Ouro Preto, Ouro Preto, Minas Gerais, Brazil {guillermo, menotti}


The recent progress of digital media has stimulated the creation, storage and distribution of data, such as digital videos, generating a large volume of data and requiring efficient technologies to increase the usability of these data. Video summarization methods generate concise summaries of video contents and enable faster browsing, indexing and accessing of large video collections, however, these methods often perform slow with large duration and high quality video data. One way to reduce this long time of execution is to develop a parallel algorithm, using the advantages of the recent computer architectures that allow high parallelism. This paper introduces parallelizations of a summarization method called VSUMM, targetting either Graphic Processor Units (GPUs) or multicore Central Processor Units (CPUs), and ultimately a sensible distribution of computation steps onto both hardware to maximise performance, called "hybrid". We performed experiments using 180 videos varying frame resolution (320 x 240, 640 x 360, and 1920 x 1080) and video length (1, 3, 5, 10, 20, and 30 minutes). From the results, we observed that the hybrid version reached the best results in terms of execution time, achieving 7x speed up in average.

Keywords: Video Summarization, GPUs, Multicore CPUs

1 Introduction

The amount and the quality of multimedia information such as audios, images, and videos are growing every day. All this information creates a large collection of data and leads to the requirement of efficient and effective management of such quantity of information. More specifically, videos are a multimedia resource that is in continuous generation. A huge quantity of video is uploaded to the Internet, TV video information is stored and security cameras generate hours of videos every day [3]. In order to increase the usability of such large volume of videos, new approaches have been studied. In this context, video summarization is a largely

Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2014 159

© The Authors. Published by Elsevier B.V.

researched subject [1, 6, 11] because of its importance as a preprocessing step in many video applications, such as indexing, browsing and retrieval [9].

Video summarization is the process of extracting a summary of the original video content, the goal is to provide concise information of the content, preserving the original message of the video [21]. New methods for video summarization have been proposed in the literature, however, often they are time consuming [12]. A simple, effective and efficient approach for automatic video summarization is proposed by [9]. This method, called VSUMM, is based on the extraction of color characteristics of the video frames and on the unsupervised learning of features. Although this approach generates summary in a feasible time for short videos with low resolution, it becomes impractical for high-resolution videos with varying duration.

In this same sense, with advances in representations of digital videos and pictures, and the need of efficient management methods for handling these data, also came the necessity of hardware that can run these methods quickly and efficiently for such large data collections. Central Processor Units (CPUs) with multiple cores are considered suitable for such processing. Simultaneously, the manycore processors with hundreds of processing elements of the Graphics Processing Units (GPUs) as computing engines have been widely adopted in recent years. On one hand, the graphic processor chips started as video processors with fixed function, but have become increasingly programmable and powerful in terms of computing, being used for general purpose computation. In 2006, the NVIDIA company has introduced CUDATM, a parallel computing architecture that leverages the parallel computing of NVIDIA GPUs to solve many complex computational problems through a much more efficient manner than in CPU [18]. On the other hand, the CPUs are much better in performing a much larger array of tasks than GPUs, are designed to lower latency in a single task as much as possible, and can handle the most complex programs with ease, but still lag behind GPUs when large amounts of simple calculations can be done in parallel [20].

Several approaches for computer vision and image/video processing on GPUs and multicore CPUs have been proposed in the literature and some related works are described below. In the field of video processing, in [5], it is discussed video encoding and decoding methods on GPUs and presents how some video coding modules can be implemented in certain ways to expose as much data parallelism as possible. In [14], it is presented an interactive retexturing approach for images and videos which is made easier via real-time bilateral grid on GPU. The experiments show the high-quality realistic retexturing of interactive images/videos with real-time feedback using their approach with latest GPU parallelism. In [13], it is described an approach for parallelization of JPEG compression on GPU architectures. By analyzing the performance of implemented approach for real-time or interactive video applications, it is demonstrated in experiments that the approach works well for 2160p/4k video resolution (2,160 lines of horizontal resolution, where p stands for progressive scan or not-interlaced) and with small impact on overall latency.

In computer vision tasks, three common feature extraction algorithms (FAST, HoG and SIFT) are analyzed in [7]. It is introduced an efficient heterogeneous multicore CPUs architecture and showed that this architecture presented the best results comparing to GPUs. In [25], SIFT feature extraction is explored on the GPUs and the KLT feature tracking algoritm [24] is also implemented on GPUs. In both cases, the authors divide computation between the CPU and GPUs and conclude that the new generation of graphics cards can perform high quality feature tracking, interest point detection and matching on high resolution videos. For 3D computer vision tasks, Bundle Adjustment algorithms exploiting hardware parallelism for efficiently solving large scale 3D scene reconstruction problems with multicore CPUs as well as multicore GPUs is proposed in [26]. The results show that GPUs lead to more space efficient algorithms

Video input

Video Segmentation

Color Feature Extraction Frames Clustering

(Color Histogram, (K-Means, Euclidean Distance) HSV, 16 bins)

Static Video Summary Composition

Keyframe Extraction Elimination of Similar Keyframes

Figure 1: VSUMM approach.

and surprisingly diminishes the runtime.

Taking this in consideration, the problem of generating summaries of long videos with high resolution using a shorter time can be solved by using Multicore CPUs and GPUs. By carefully analyzing the video summarization method called VSUMM proposed in [9], we introduce parallelization in the method and consequently the acceleration of VSUMM by using GPUs, multicore CPUs, and an hybrid approach combining these architectures. Experiments show that speed up about 20 x can be achieved. The main contributions of this work are the fol-lowings: 1) The proposition of three parallel algorithms for an automatic video summarization approach. To the best of our knowledge, in the literature, it has not been found any paralleliza-tion of video summarization method. 2) The idea of building a three-dimensional CUDA grid of threads to make simultaneous computations in all video frames that fit in memory. Usually, in the literature, the algorithms do the computations on GPUs for each frame separately. 3) The evidence that hybrid multicore CPUs & GPUs algorithms can gather the best of both architectures, significantly improving the runtime. 4) For experiments, the creation of a database with YouTube videos from different genres with three resolutions (1920 x 1080, 640 x 360, and 320 x 240 pixels) and six video lengths (1, 3, 5,10, 20, and 30 minutes) - ten for each combination, adding up 180 videos, which are publicly available along with the source code used [2].

The remainder of this paper is organized as follows: Section 2 presents the VSUMM method used as basis for our study. Section 3 describes the three different parallel implementations of the VSUMM. Section 4 provides the experiments and results while Section 5 outlines conclusions.

2 Automatic Video Summarization Approach

Video summarization is the process of extracting a summary of the original video content, whose goal is to quickly provide concise information from video content, preserving the original message of the video [21]. The approach for automatic summarization used as the basis for parallelization in this work is the one proposed in [9] called VSUMM. This approach is divided in 5 steps. They are briefly presented in the following and a flowchart illustrating these steps

is shown in Figure 1.

2.1 Video segmentation

There are several types of temporal video segmentation methods. In this approach, the frame segmentation adopted consists of processing separately each frame. Moreover, VSUMM does not consider all the video frames, using only a sampled subset at a rate of one frame per second.

2.2 Color feature extraction and Elimination of meaningless frames

In VSUMM, the color histogram is used to describe the visual content of video frames, representing the distribution of color value frequency in an image. In this approach, the frames are transformed from the RGB to the HSV color space, which is a popular choice for manipulating colors [9]. The color histogram is computed only from the Hue component, which represents the dominant spectral color component in its pure form [9]. Finally the quantization of the color histogram is set from 256 to 16 color bins, aiming to significantly reduce the amount of processed data without losing important information [9].

However, before computing the short hue histogram, the monochromatic frames originated due to fade-in/out effects are eliminated since they are considered meaningless. That is, those frames whose standard deviation of the RGB color feature vectors are sufficiently small or close to zero are removed [9]. Note that this step is executed as a pre-processing.

2.3 Frame clustering

The clustering algorithm chosen is fc-means [16], one of the simplest unsupervised learning algorithms and widely used, presenting promising results for the considered problem. The main idea of fc-means is to simultaneously maximize the between-clusters distance and to minimize the intra-clusters distance. Initially, the algorithm distributes the video frames among the fc clusters, in which fc is determined a priori [9]. Then, it computes the average of the elements of each group, corresponding to the initial/new centroids. The next step is to determine for each element the group corresponding to the closest centroid. With this new rearrangement of elements, new centroids are calculated. So, the process is repeated until it converges, i.e., no change occurs or the centroids stabilize.

2.4 Keyframe extraction

Once the clusters are computed by fc-means, the next step is to identify the keyframes. A cluster is considered a keycluster if its size is larger than half the average cluster size. Similarly, for each keycluster, the closest frame to the keycluster centroid is selected as a keyframe.

2.5 Elimination of similar keyframes

To eliminate similar keyframes, all keyframes are compared among themselves using color histogram. If the Euclidean distance between keyframes is lower than a threshold t, then the keyframe is removed from the summary. Finally, the remaining keyframes are arranged in temporal order so that the produced summary is more understandable.

3 Parallel Versions for VSUMM

In this section, three parallel versions of VSUMM approach are presented. We start with a GPUs version, then a multicore CPUs implementation, and finally a hybrid version gathering the best features of multicore CPUs and GPUs architectures.

3.1 Parallel GPUs Version

The GPUs are especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and also as it is executed on many data elements and has high arithmetic intensity, the memory access latency can be reduced with calculations instead of big data caches [8].

In recent years, a lot of research about how to use GPUs for general purpose computing have been explored. Before that, GPU implementations could only be achieved using existing 3D-rendering APIs such as OpenGL and DirectX. Once the value of GPUs for general-purpose computing has been realized, GPU vendors added driver and hardware support to use the highly parallel hardware of the GPU without the need for computation to proceed through the entire graphics pipeline and without the need to use 3D APIs at all [4]. NVIDIA's solution is the CUDA language, an extension to C with keywords that designate data-parallel functions, called kernels, and their associated data structures to the compute devices. In NVIDIA CUD A programming model, the system consists of a host that is a traditional CPU and one or more compute devices that are massively data-parallel coprocessors. Each CUDA device processor supports the Single-Program Multiple-Data (SPMD) model, where all concurrent threads are based on the same code, although they may not follow exactly the same path of execution [22].

The first parallel version for VSUMM is implemented in CUDA, taking advantage of the manycore capabilities of GPUs and the implementation details of the three first and main steps are explained below. According to our analysis, the other steps are inexpensive in the sequential version such that they were not parallelized.

Video segmentation

NVCUVID [17] is a NVIDIA library to decode videos in CUDA. Nonetheless, intensive tests performed with this library have shown that the run time to decode videos, in most cases, was greater than decoding using CPU. Therefore, the CPU sequential decoding is chosen.

Feature extraction and elimination of meaningless frames

The feature extraction consists in transforming frames from the RGB color space to the HSV and then calculating the Hue color histogram. The main idea of the parallelization in this step is to run operations for the largest possible amount of frames at the same time in GPUs. In order to do that, the CUDA grid of threads is considered as a three-dimensional grid. The first two dimensions are used for the width and height of the frames and the third dimension is used for the sequential number of the frame in video that the GPUs are able to simultaneously process.

Towards transforming the frames from the RGB to the HSV color space, each thread of the grid computes the new color value of each RGB pixel for all pixels from all frames in the GPUs.

For example, if the frame size is 1920 x 1080 pixels and we put one hundred frames in GPUs, we need 207,360,000 threads to perform this operation.

In order to compute the color histogram, the same idea is used: each thread accesses one pixel of the frame and increments the correspondent bin at the final histogram. Computing histograms in CPU is trivial and easy to implement, however in GPUs the calculation is not straightforward [23]. The problem is that multiple threads may simultaneously access the same histogram bin for +1 incrementing, which causes a conflict among the threads, as one can see in Figure 2.

Figure 2: Parallel computation of a histogram with BINS bins and N threads. Histogram updates conflict and requires synchronization of the threads or atomic updates to the histogram memory.

In order to solve this issue, atomic increments are used to force the algorithm to stream the increments. Atomic functions are used to solve all kinds of synchronization and coordination problems in parallel computer systems. The general concept is to provide a mechanism for a thread to update a memory location so that the update appears to happen atomically (without interruption) respecting the other threads [8]. The so called atomicAdd(addr,y), in CUDA C, generates an atomic sequence of operations that reads the value at address addr, adds y to that value, and stores the result back to the memory address addr.

Note that the elimination of meaningless frames is done before the extraction of features, since it requires the computation of the RGB feature vectors and the removed frames do not need to be further processed. For computing the standard deviation of each feature vector (color histogram), the normalized histogram is computed. In most applications, the range of value adopted is [0,1]. The process of normalization is performed by dividing the original histogram by the number of image pixels. The normalized histogram has the same number of colors of the original histogram, the only difference between them is in the values of each colors. This normalization is done in GPUs, in which each thread normalizes one bin of the color histogram. Then, the square of each bin is computed and the result is stored in a new vector. Finally, this vector is summed up using atomic increments and then the square root is calculated.

Frame clustering

The clustering algorithm used in this step is the fc-means algorithm. This algorithm can be divided into five main steps: 1) choose the number of clusters, k; 2) divide the objects (frames represented by their feature vectors) in random groups; 3) compute the centroids of each cluster; 4) compute the distance between each object and the centroids and link the object the group whose distance to the centroid is minimal; 5) repeat steps (3) and (4) until convergence.

The GPUs implementation focuses on finding the nearest cluster for all points by computing the distance between each point and the centroids, i.e., step 4 of the algorithm [15]. This step is the most time consuming of the fc-means steps that can be parallelized. To do that, the clusters are divided into the CUDA blocks of threads. Each thread computes the nearest cluster for each object, in this case the objects are the feature vectors. After finding the nearest centroid, the object membership changes and all threads within the blocks are synchronized, to count the number of objects that moved to other cluster. This count is performed because when the objects stop moving to other clusters, the algorithm converged. After that, the information of the new clusters that each object belongs is copied to the CPU and the new centroids are computed for all clusters. This two steps are repeated until convergence.

3.2 Parallel multicore CPUs Version

Starting in 2004, the microprocessor industry has shifted to multicore scaling, increasing the number of cores, as its principal strategy for continuing efficiency growth [10]. The multiprocessor CPUs are focused on task parallelism and reducing latency. The actual multicore CPUs allow the programmers to implement efficient parallel algorithms in CPU. Each core can perform one different task, all at the same time. In this case, there is no restrictive memory access like in GPUs. Multicore CPUs are not as good as GPUs when large amounts of simple calculations are done in parallel, but when the task is more complex, or when it involves other libraries, multicore CPUs can reach a better performance. The implementation details of the three first and main steps of VSUMM approach using multicore CPUs are explained below. Note that the number of threads chosen for multicore CPUs implementation is the number of cores of the computer used, so no core is left idle.

Video segmentation

The main contribution of this version is the video segmentation. The OpenCV library is used for decoding the videos. As already stated, GPUs threads cannot execute functions of other libraries, and multicore CPUs can. To parallelize this step of VSUMM with multicore CPUs, each thread has the task to decode one part of the video, and one thread is started for each core of the CPU. To define which part of the video is decoded for each thread, the total number of frames is divided by the number of threads and doing so each thread knows the indices of the frames that it has do decode. The OpenCV library allows accessing specific frame number and then decoding it.

Feature extraction and elimination of meaningless frames

Calculating the color histogram and converting the image from the RGB color space to the HSV one, using multicore CPUs consists in dividing the frames between the threads and each one do the computations of a portion of the frames. The idea is similar to the one presented in GPUs version, but the difference is that GPUs have many threads that can do it at the same time and CPU has only a few threads. The elimination of meaningless frames is done with the main thread, since the calculations are simple and small and multicore CPUs have a lot of overhead. Note that this computation is faster using only the main thread.

Frame clustering

As reported earlier, the fc-means algorithm is used in this stage of VSUMM and it can be divided into five main steps. The multicore CPUs implementation parallels steps 3 and 4.

The implementation uses OpenMP, a collection of compiler directives, library routines, and environment variables for shared-memory parallelism in C, C++ and Fortran programs [19]. OpenMP parallels the iterations of fc-means until convergence using the directives #pragma omp parallel and #pragma omp for. The for command that iterates over all objects is then able to: 1) find the nearest cluster and then assigning the membership of the object (step

3); and 2) averaging the objects within each new cluster to calculate the new centroids (step

4). Note that all these computations are done in parallel by a number of threads specified to OpenMP.

3.3 Hybrid multicore CPUs & GPUs Version

Having a highly parallel, high throughput computation resource like the GPUs easily available and accessible in addition to the modern multicore CPUs allow us the best of both worlds to be realized [20]. The main idea of this version is to merge the best parts of the multicore CPUs version with the best parts of the GPUs version.

As said before, the video segmentation used in GPUs version is the same as the CPU sequential version, due to the lack of a fast algorithm to decode video frames only in GPUs. Nevertheless, in parallel CPU version, this step was accelerated with multicore CPUs and this best version is used here. On the other hand, the feature extraction involves simple math computation for a large set of frames and GPUs threads are good for this kind of task. While some multicore CPUs execute this computation, many GPUs threads do it simultaneously in a fast way. Finally, the fc-means clusterization done in GPUs is also chosen for this version because it achieves some best results (faster runtime) than the multicore CPUs version.

Figure 3 shows which computer architecture is used to implement each VSUMM step for the three parallel approaches introduced in this work. Notice that VSUMM steps 4 and 5 are not parallelized in this work. The hybrid version, as said before, contains the best/fastest implementations between multicore CPUs and GPUs versions.


Hybrid -

Multicore CPUs

Multicore CPUs

Multicore CPUs



Figure 3: Parallel versions and architectures used for each VSUMM step.

4 Experiments and Results

For measuring the efficiency of our proposed parallel implementations, experiments were performed using an Intel Core i7 CPU computer, with 16GB RAM, and a NVIDIA GeForce GTX 650, which contains 384 CUDA cores, on Windows 8 (64-bits). For the experiments, we created a video database using YouTube videos with different genre such as, music, sports, cartoons,

Figure 4: Percentual run time of step for the implementations: (a) CPU; (b) Multicore CPUs; (c) GPUs; and (d) Hybrid. Steps are represented by colors: ciano (segmentation), magenta (feature extraction), yellow (clustering), and red (others). The video runtimes are ascendent sorted by their time length and separated by black lines.

etc. Three different video resolutions were chosen (1920 x 1080, 640 x 360, and 320 x 240 pixels) and six different video lengths (1, 3, 5,10,20, and 30 minutes). For each combination of resolution and length were chosen ten videos, adding up 180 videos. The entire database and the parallel implementations are publicly available in [2].

Figure 4 shows the video run time percentage used for each VSUMM steps: video segmentation, feature extraction, frames clustering, and other computations. As one can observe, the time spent for each step of VSUMM depends on the video, changing a lot for CPU and mul-ticore CPUs versions. On the other hand, it is noticeable that the video segmentation (video decoding) is the most expensive task for GPUs and Hybrid versions, being the bottleneck of our parallelism.

Figure 5 shows the average overall runtime for all versions tested in this work, i.e., for each video configuration. For 320 x 240 and 640 x 360 video resolutions, the Hybrid and GPUs versions remained stable for all video length with small standard deviation, and achieving respectively the fastest and second fastest results. The multicore CPUs version also remained


1000 r

TS 800 c


10 20 Video Length (min)

"Ö 800

10 20 Video Length (min)

10 20 Video Length (min)

Figure 5: Average overall run time for different video resolutions: (a) 320 x 240; (b) 640 x 360; (c) 1920 x 1080. Vertical bars stand for standard deviation.

Table 1: Effectiveness in Frames per second (FPS) of the implemented approaches

Versions FPS reached

Max Min Average

CPU Sequential 17.815 2.183 5.726

multicore CPUs 69.236 2.191 19.653

GPUs 67.227 4.052 17.924

Hybrid 99.149 6.433 35.395

stable, with results very closed to those of the GPUs version. All versions showed good results for small video durations (1, 3 and 5 minutes), but when the video length increases, the run time also increases in the same proportion of the resolution. Only for larger video durations (20 and 30 minutes), the Hybrid version performed marginally better. Anyway, for 320 x 240 and 640 x 360 video resolutions, the run time for Hybrid, GPUs, and multicore CPUs versions are very close.

In contrast, for 1920 x 1080 video resolution (the largest video resolution used in our experiments) the run time increased proportionally and did not remain stable for any version (the standard deviation increased). The Hybrid and GPUs versions are impaired since a great amount of data had to be copied from CPU to GPUs during feature extraction step. This copying process occurred many times due to the small size capacity of GPUs memory, we can clearly see this for videos of 5,10, 20 and 30 minutes length. Comparing the results, we can see that the Hybrid version achieved the best result and the sequential version the worst one. It is possible to see that multicore CPUs results for the total execution time are better than GPUs ones, this happened in all video sizes. The reason of this is that the great amount of data that had to be copied from CPU to GPU in the GPUs version and that the multicore CPUs version speed up the video decoding step, which is the bottleneck of GPUs version.

Table 1 shows the maximum, minimum and average frames per second (FPS) processed by each version of VSUMM. As one can observe and already stated in Figure 5, again the Hybrid version is the best one, achieving a maximum rate of almost 100 FPS. From the value in Table 1, we can figure out speed up values having as reference the CPU Sequential FPS.

The total average speed up for the Hybrid version is about 7x; for the GPUs and the multicore CPUs versions the average speed up is around 4x . The Hybrid version reached the maximum speed up of 20 x for 30-minutes video of 240 x 320 resolution and the minimum of 2.5x for 20-minutes video of 1080 x 1920 resolution.

It is important to note that for each video configuration (resolution and length), the speed up is different, because, as said before, the run time percentage of each VSUMM step differs a lot from video to video. In addition, the speed up for each step is different, 3.6x for multicore CPUs video segmentation; 15.6x for GPUs feature extraction and 4.7x for multicore CPUs feature extraction. These speed up have a small standard deviation between the different video configurations. However, the fc-means parallel implementation in GPUs and multicore CPUs proved to be very unstable, with speed up changing a lot from one video length to another. The random initialization of k-means algorithm may cause this fact.

5 Conclusion

In this paper, three parallel implementations for a video summarization approach called VSUMM [9] were introduced. The first one takes advantages of the graphics processing units, which can simultaneously have thousands of threads doing computations. The second one used multicore CPUs to execute different tasks simultaneously. Finally, the last one combines the best parts of the previous versions. The results showed that the hybrid version, mixing GPUs and multicore CPUs reached the best results for all resolutions and video lengths. Moreover, this implementation presented the most stable runtime (small standard deviation) with increasing of video length. For some videos, the speed up is about 20x and the hybrid implementation achieved a maximum processing rate of almost 100 frames per second.


The authors would like to thank to CAPES, CNPq, FAPEMIG and FAPESP for providing funds to support this project and also to anonymous reviewers for their valuable comments to increase the readability of this manuscript.


[1] Jurandy Almeida, Neucimar J. Leite, and Ricardo da S. Torres. VISON: Video summarization for ONline applications. Pattern Recognition Letters, 33(4):397-409, 2012.

[2] Suellen Silva Almeida. Parallels implementation for VSUMM and databases. https://github. com/susilvaalmeida/vsumm, 2013.

[3] E.J.Y.C. Cahuina and G. Camara-Chavez. A new method for static video summarization using local descriptors and video temporal segmentation. In Conference on Graphics, Patterns and Images (SIBGRAPI), pages 226-233, 2013.

[4] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. A performance study of general-purpose applications on graphics processors using {CUDA}. Journal of Parallel and Distributed Computing, 68(10):1370 - 1380, 2008. General-Purpose Processing using Graphics Processing Units.

[5] Ngai-Man Cheung, Xiaopeng Fan, O.C. Au, and Man-Cheung Kung. Video coding on multicore graphics processors. IEEE Signal Processing Magazine, 27(2):79-89, 2010.

[6] Gianluigi Ciocca and Raimondo Schettini. Erratum to: An innovative algorithm for key frame extraction in video summarization. Journal of Rea,l-Tim,e Image Processing, 8(2):225-225, 2013.

[7] J. Clemons, A. Jones, R. Perricone, S. Savarese, and T. Austin. Effex: An embedded processor for computer vision based feature extraction. In ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1020-1025, 2011.

[8] NVIDIA Corporation. NVIDIA CUDA C Programming Guide. NVIDIA, 2012.

[9] Sandra Eliza Fontes de Avila, Ana Paula Brando Lopes, Antonio da Luz Jr., and Arnaldo de Albuquerque Araujo. Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters, 32(1):56-68, 2011. Image Processing, Computer Vision and Pattern Recognition in Latin America.

[10] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Power challenges may end the multicore era. ACM Communications, 56(2):93-102, 2013.

[11] R.H. Evangelio, T. Senst, I. Keller, and T. Sikora. Video indexing and summarization as a tool for privacy protection. In International Conference on Digital Signal Processing (DSP), pages 1-6, 2013.

[12] M. Furini, F. Geraci, M. Montangero, and M. Pellegrini. On using clustering algorithms to produce video abstracts for the web scenario. In IEEE Consumer Communications and Networking Conference (CCNC), pages 1112-1116, 2008.

[13] Petr Holub, Martin ?rom, Martin Pulec, Ji? Matela, and Martin Jirman. Gpu-accelerated DXT and JPEG compression schemes for low-latency network transmissions of hd, 2k, and 4k video. Future Generation Computer Systems, 29(8):1991-2006, 2013.

[14] Ping Li, Hanqiu Sun, Chen Huang, Jianbing Shen, and Yongwei Nie. Interactive image/video retexturing using GPU parallelism. Computers & Graphics, 36(8):1048-1059, 2012. Graphics Interaction Virtual Environments and Applications (2012).

[15] W.-K. Liao. Parallel k-means data clustering. Kmeans/index.html, 2005.

[16] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297. University of California Press, 1967.

[17] NVIDIA. NVIDIA CUDA Video Decoder. index.html, 2010.

[18] NVIDIA. CUDA Parallel Computing Platform., 2012.

[19] OpenMP Architecture Review Board. OpenMP application program interface version 4.0, 2013.

[20] J. Palacios and J. Triska. A comparison of modern gpu and cpu architectures: And the common

convergence of both. Technical report, Oregon State University, 2011.

[21] Silvia Pfeiffer, Rainer Lienhart, Stephan Fischer, and Wolfgang Effelsberg. Abstracting digital movies automatically. Journal of Visual Communication and Image Representation, 7(4):345-353, 1996.

[22] Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 73-82, 2008.

[23] Jason Sanders and Edward Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 1st edition, 2010.

[24] Jianbo Shi and Carlo Tomasi. Good features to track. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 593-600, 1994.

[25] Sudipta N. Sinha, Jan-Michael Frahm, Marc Pollefeys, and Yakup Genc. Feature tracking and matching in video using programmable graphics hardware. Machine Vision and Applications, 22(1):207-217, 2011.

[26] Changchang Wu, S. Agarwal, B. Curless, and S.M. Seitz. Multicore bundle adjustment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3057-3064, 2011.