Scholarly article on topic 'Analyzing Traffic Patterns on Street Segments Based on GPS Data Using R'

Analyzing Traffic Patterns on Street Segments Based on GPS Data Using R Academic research paper on "Social and economic geography"

Share paper
Academic journal
Transportation Research Procedia
OECD Field of science
{"GPS patterns" / "traffic flow" / "car mobility" / R-language / congestion}

Abstract of research paper on Social and economic geography, author of scientific article — Emilian Necula

Abstract Nowadays GPS enabled devices are widely spread between drivers making the collection of GPS data more accessible. So there is an opportunity to infer useful patterns and trends. In this research, we plan to apply a statistical approach on 10000 vehicle GPS traces, from around 3600 drivers which are mined to extract the outlier traffic pattern to be used further in an Intelligent Transportation System. We choose to divide the urban area into a grid and organizing the road infrastructure as segments in a graph. Further, at a given time we can make an assumption regarding the congestion level in a specific area taking into account the visits for each vehicle, using the GPS trace data. Over time, the visited segments will settle into a pattern and vary periodically. In this study we will use R software in conjunction with a set of libraries. They provide an environment in which we can perform statistical analysis and produce graphics to annotate different results. Our objective is to identify contiguous set of road segments and time intervals which have the largest statistically significant relevance in forming traffic patterns. Taking into account the number of drivers that submitted their routes in correlation with the entire population on New Haven we can state that a 2-3% penetration rate of smart phones is enough to provide accurate measurements of the traffic flow and identification of traffic patterns.

Academic research paper on topic "Analyzing Traffic Patterns on Street Segments Based on GPS Data Using R"

Available online at


Transportation Research Procedia 10 (2015) 276 - 285



18th Euro Working Group on Transportation, EWGT 2015, 14-16 July 2015,

Delft, The Netherlands

Analyzing traffic patterns on street segments based on GPS data

using R

Emilian Necula*

Faculty of Computer Science, University Al. I. Cuza, General Berthelot, 16, Iasi 700483, Romania


Nowadays GPS enabled devices are widely spread between drivers making the collection of GPS data more accessible. So there is an opportunity to infer useful patterns and trends. In this research, we plan to apply a statistical approach on 10000 vehicle GPS traces, from around 3600 drivers which are mined to extract the outlier traffic pattern to be used further in an Intelligent Transportation System. We choose to divide the urban area into a grid and organizing the road infrastructure as segments in a graph. Further, at a given time we can make an assumption regarding the congestion level in a specific area taking into account the visits for each vehicle, using the GPS trace data. Over time, the visited segments will settle into a pattern and vary periodically. In this study we will use R software in conjunction with a set of libraries. They provide an environment in which we can perform statistical analysis and produce graphics to annotate different results. Our objective is to identify contiguous set of road segments and time intervals which have the largest statistically significant relevance in forming traffic patterns. Taking into account the number of drivers that submitted their routes in correlation with the entire population on New Haven we can state that a 2-3% penetration rate of smart phones is enough to provide accurate measurements of the traffic flow and identification of traffic patterns.

©2015The Authors.Publishedby Elsevier B.V. This is an open access article under the CC BY-NC-ND license


Peer-review under responsibility of Delft University of Technology

Keywords: GPS patterns; traffic flow; car mobility; R-language; congestion

* Presenting author E-mail address:

2352-1465 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license


Peer-review under responsibility of Delft University of Technology

doi: 10.1016/j.trpro.2015.09.077

1. Introduction

Nowadays the benefits and applications of an accurate measurement and analysis of traffic patterns are broad. It can be used directly in various driver information systems, but it can also be part of larger traffic related systems, such as traffic management systems or personal car navigation. The input for the system that tries to identify movement patterns can also be various. It can consist of a more traditional data originating from stationary in-road sensors (like loop detectors) or it may be the moving car data, generated by the vehicles equipped with a GPS enabled device. The latter type of data is especially interesting for several reasons; first it quickly becomes more widespread and available (even in real-time) with the rising popularity of on-line personal car navigation or vehicles monitoring services. The moving car data also has the potential to provide exhaustive coverage of the whole road network and as a result it can yield better traffic pattern estimation (or generally traffic forecasting) model. Conversely, this kind of data can be very irregularly distributed in time and space (or even missing most of the time in certain areas) so building traffic models can be difficult. More precisely, our goal is to determine groups of street segments which possess a similar traffic load over time. In a second step we interpret these patterns considering their temporal as well as spatial characteristics. We are using available GPS traffic traces and apply statistical data-driven computational methods. An advantage of our solution is the flexibility in incorporating additional explanatory variables. Necula (2014) also used this method in the implementation of a mesoscopic traffic simulator, which is more adequate then the traditional speed-density simulators. While these general methods and tools are pre-existing, their application into the specific problem and their integration into the proposed framework for traffic flow estimation is new. The methodology is applied on a data set from New Haven County, Connecticut, USA. Within the overall analyzing process, data mining is viewed as the sub-process concerned with the discovery of hidden information. We apply a clustering algorithm to identify groups of street segments which possess a similar traffic distribution over the week and weekday to extract knowledge that will facilitate traffic flow analysis. Subsequently, we interpret the resulting clusters according to their temporal pattern as well as their geographic location. To aid in the exploration of traffic data mining we used R software to represent various states of the traffic flow.

The paper is organized as follows. Section 2 discusses related work, while Section 3 starts describing the GPS data sampling and acquisition. Moreover we introduce the analysis process containing data processing step, clustering, as well as the temporal and spatial interpretation of the traffic clusters to make a relevant decision upon congestions occurrence. We discuss our results in Section 4 and conclude the paper with a summary and outlook on future work.

2. Related work

The analysis of GPS trace data is a very active research area and has developed a number of algorithms for the clustering of trajectories presented by Lee et al. (2007), Piciarelli et al. (2008) and Rinzivillo et al. (2008) as well as various definitions of distance functions between traces introduced by Pelekis et al. (2011). Our approach differs from the previous works as we do not concentrate on traces as principal objects of interest. Instead, we also evaluate the frequency and temporal distribution of people passing a specified set of locations of an urban area, in our case street segments. Often such information is collected by traffic management centers directly on the level of street segments. But we do not have access to such databases. Fitschen and Nordmann (2010), using traces collected by the German traffic authority, analyzes and cluster time variation curves using a daily, weekly or yearly granulation of time. A very detailed analysis of traffic distributions has been conducted in Weijermars (2007). In addition to the analysis of weekly patterns also seasonal variations and variations in weather factors were analyzed. However, the basis for this analysis was data from induction loops (continuous observations of vehicular traffic for a selection of major road segments). Extensive analysis of GPS data using methods from visual analytics have been conducted in Andrienko (2008). The author formed spatial aggregation units based on a grid and provided visualization techniques for the analysis of temporal variation in speed and direction. The variations were visualized using geographically aligned mosaic diagrams, which indicate for each day of the week and hour of the day the intensity of the analyzed parameters. However, the exploration of grid cells with similar usage patterns was left to the visual capabilities of the user. In contrast, our data set contains usage patterns for several thousand street segments, which cannot be analyzed by visual inspection only.

Recently, several more attempts by Shen and Ma (2008) and in Shaw and Yu (2009) have been made to use GIS for measurements and analysis of individual mobility under special space-time constraints. Space-time sequence analysis introduced by Shoval and Isaacson (2007) and in Shoval (2008) accounts for the fact that data points (driver's locations) taken over space and time may have an internal structure such as autocorrelation, trend or seasonal variation. Indeed, although we rarely perceive any of our actions to be random, from the perspective of an outside observer who is unaware of our motivations and schedule, our traffic pattern can easily appear random and unpredictable. However, Gonzalez et al. (2008) found out that driver's trajectories show a high degree of temporal and spatial regularity, each individual being characterized by a time-independent characteristic travel distance and a signification probability to return to a few highly frequented locations. Similarly, by measuring the entropy of each individual's trajectory, Song et al. (2010) demonstrated that a 93% potential predictability in user mobility across the whole user base. These studies indicate that despite the diversity of driver's travel history, drivers follow simple reproducible patterns. This inherent similarity in travel patterns could impact all phenomena driven by human mobility, from congestion prevention, urban planning and agent-based modelling.

Traffic state analysis is a key problem with considerable implications in modern traffic management. Several modeling approaches have been used, including Kalman Filter (Wang et al., 2006a,b; Liu et al., 2006), neural networks (van Lint, 2008; Vlahogianni et al., 2008; Dunne and Ghosh, 2012) and others (Stathopoulos and Karlaftis, 2003; El Faouzi et al., 2009). Karlaftis and Vlahogianni (2011) compare statistical methods and neural networks in transportation research, highlighting some of the differences of the two types of data analysis tools. Various techniques have been used to estimate multi-regime traffic models. Sun and Zhou (2005) use cluster analysis to segment speed-density data and determine the regime boundaries for typical (two-regime and three-regime) speed-density models.

Clustering and classification are popular techniques with many related studies. El Faouzi (2004) presents a data-driven approach that aggregates multiple estimators, attempting to aggregate all the information which each estimation model embodies (some of which might be lost if only the "best" model was chosen and applied). El-Faouzi and Lefevre (2006) use two different approaches from evidence theory (classifier fusion and distance-based classification) for clustering and classification for road travel time estimation. Azimi and Zhang (2010) apply three different unsupervised learning methods (K-Means, Fuzzy C-Means and CLARA) to classify freeway traffic flow conditions based on the characteristics of the flow. Antoniou and Koutsopoulos (2006a) present a framework for the estimation of speeds using machine-learning approaches.

3. Proposed analysis

3.1. Data acquisition

In this research we considered for our study the urban area of the city New Haven, Connecticut, USA. New Haven is the second-largest city in Connecticut (after Bridgeport), with a population of 130,741 people, covering an area of 52,1 km2. It has a long tradition of urban planning and a purposeful design for the city's layout. The city could be argued to have some of the first preconceived layouts in the country. For the purpose of the experiment, we needed a large GPS traffic database, so we considered using the route traces from the MapMyRun (MapMyRun, 2014) service. A website that manages each user traces and displays useful information is available - Fig. 1a. In other words this application is able to collect route traces from a large number of users that have installed MapMyRun mobile application on their smart phone devices (either Android, iOS, Windows based) and submitted previoulsy their daily routes to the system. From New Haven almost 4000 users are registered as active users of MapMyRun (around 2 - 3% from the entire population).

The GPS dataset was collected using a script written in R. R is an open source software environment for statistical computing and graphics, having large extensibility through user-created packages available from Comprehensive R Archive Network (CRAN), (R Core Team, 2014). There is a vast collection of R packages (Cran Packages, 2014) and functions, specifically made for data mining. These include packages for clustering (fpc, cluster, pvclust, mclust), which contain common clustering algorithms such as k-means, hierarchical clustering, DBSCAN as well as packages for plotting cluster solutions. There are also classification packages (rpart, party, tree, etc.), which contain decision tree, regression and survival analysis algorithms, while the association rules and

frequent itemsets packages (arules, drm) contain algorithms for finding frequent itemset, association rules (e.g. APRIORI algorithm). Other available packages include sequential patterns package (arulesSequences), time series, statistics, and graphics as well as various data manipulation packages (httr, XML, ggmap); and the interface to WEKA mining tool. There exist several Graphical User Interfaces (GUIs) to interface with R, among which RGui is specifically designed for our purpose of collecting and analyzing GPS data traces. The workflow of data mining with R typically comprises importing the required packages, importing data, transforming data to a convenient format for analysis, using the data mining/statistical functions and visualizing, validating and exporting results.

Fig. 1. (a) Web interface for MapMyRun route traces service; (b) Part of the html page with the 10000 route traces from New Haven, US.

For this study, the 10000 GPS data have been extracted using a R script. An accessible method was to obtain the webpage with the corresponding routes, after triggering a request on MapMyRun server using the following resource link: Part of the webpage with the resulting GPS traces is displayed in Fig. 1b. Afterwards we saved the webpage into a text file called "10000MapMyRunRoutes.txt" and used it for the next collecting step.

Furthermore we started looking into the text file for route IDs in order to individually extract the GPS traces from the server. We used a regular expression to identify all the route IDs and store in "IDRuns". For the data collection and manipulation, R allows operations on array-based objects which significantly simplifies all array-based data object manipulation by imposing no requirements on looping on array dimensions. Using a special library, HTTR (Cran Packages, 2014) we were able to construct the final download GPX resource links and simulate a GET method on the MapMyRun server. To complete the process and call the GET method we had to generate an authorization key based on a API-Key obtained after signing in (Underarm our, 2014) as developers. The code snippet from the script is shown in Fig. 2.


library (httr)

setvd('C:/Users/EalllanMec/Deskto Runs <- readLlneaC IDRuns <- Runs|grep('article clas

IDRuns <- regitatches (IDRuns, r| IDRuns <- gsub(11 i'.s , IDRuns)

IDRuns <- gsub(' IDRuns)

gforlj In ¡length(IDRuns)) ( route_id <- IDRuns(J) url <- paste("r,ttps : i'/r autni nUH_ld,


GET (url,


/GPS database

:rappedRoutes. - Runs)) .

■" = >

ley" - "2k4cuay2z4b68tdpt3pxtaf2bb9Scp8:

"X-Ongmating-Ip" - ", "Content-Type" « "application/json"), (destflle),


R<9px xislns""http://ion».topografix.cob/OPX/1/1":

3.39 bi drive on 9/23/2013

j Ctrkpt lat»"41.3304411" lon-"-72.9197885333"/> <trkpt lat—41.3304491908" Ion— -72. 9198516349"/

Ctrkpt l«t-"41.3303634480" lon»"-72.9190224261"/> <trlcpt iat-"41. 330347034" lor.-"-72 . 9189809974"/> ■ </crkseg> h </erk>

-</gpx> j

Fig. 2. The R script that downloads 10000 GPS traces from New Haven ,US and obtains the GPX files.

At the end of the acquisition step we obtained 10000 GPX traces that describe the routes made by the drivers in the New Haven city between 01.01.2012 - 31.12.2013. A GPX file is structured similar to a XML file. A route or a track has a name that identifies the total distance (miles), type of activity and date. A route is stored as a sequence of

track segments ("trkseg" tag) which are composed by track points ("trkpt" tag). A track point is formed by a pair of coordinates (longitude and latitude as attributes). A sample GPX file (MapMyRun_301440575.gpx) that stores a route is represented in the right side of Fig. 2 as output.

The 10000 GPS traffic database occupies 1,1Gb. The trajectories cover the entire city, while the central area concentrates relatively higher trip volumes. In order to obtain representative results over one week it is important that each driver provides a high number of measurement days. For example, if a driver's records covers only Monday and Tuesday his trajectories do not support the analysis of weekends. However, as movements have a repetitive character (Gonzalez et al., 2008; Schlich and Axhausen, 2003) it can be expected that the driver visits similar street segments on the weekend as well. Thus a small number of measurement days increase the risk that we underestimate the number of street segments visits and thus introduce an error into our analysis. Similarly, in order to provide a high temporal resolution it is important that each street segment possesses many visits. However, as the introduction of a lower bound for the number of valid measurement days per driver and the number of visits per street segment reduces the size of the data set. So we have to find a trade-off between data quality and remaining data size.

With respect to the number of valid measurement days we decided to introduce a lower bound of 5 days per driver. This leads to a reduced traffic data set of 3521 (91%) test drivers in New Haven. Related to the number of visits we decided on a threshold of 20 visits per street segment. After applying this threshold will remain 5330 segments (65%) from New Haven road infrastructure. Table 1 summarizes the statistics of the data filtering process.

Table 1. Statistics of the are urban area covered and number of drivers used in the research Data collection metrics Initial After filtering % from total

Drivers using MapMyRun 3870 3521 91

Street segments 8200 5330 65

3.2. Dataprocessing

In order to find street segments with a similar development of traffic pattern over time, the data has to be aggregated and normalized. The level of aggregation determines which details are still visible in the data. However, a too fine-grained level will result in sparse data. In addition, it will decrease the power of our clustering due to a high number of attributes (Mitchell, 1997). The goal of our analysis is to find street segments with a similar traffic flow over the day as well as over the week. As traffic conditions can change quite fast over the day, we decided to use the hour as daily aggregation unit. However, considering the traffic patterns occurrence, an aggregation based on weekdays is recommended as humans are known for their repetitive behavior over time. Typical mobility studies about the traffic load as Fitschen and Nordmann (2010), show that traffic during working days differs substantially from traffic during the weekend. Also Saturday and Sunday differ from each other, which is plausible due to the closure of shops on Sundays (and holidays) in most European countries. Between working days differences exist mostly due to commuting behavior before and after the weekend. Therefore aggregate only the working days Tuesday - Thursday and keep separate records for Monday and Friday. However, as the differences are small and concern mostly the major daily traffic peak, we decided not to divide working days further. In summary, we formed the following groups of weekdays for our aggregation: 1. Monday - Friday; 2. Saturday; 3. Sunday and holidays.

Having selected the aggregation units, we can now formalize the aggregation of visits. Let h e {0, 1, ..., 23} index the hours of a day, let d e {1, 2, ..., 7} index the days of the week, let g e {1, 2, 3} index the groups of the weekdays and let n,dhh denote the number of visits on street segment i, on day d, at hour h. The aggregation nigh for street segment i over the chosen units of time is defined as follows in Equation 1:

£d=i nw

H,a,n = } 5 (1) n*^ = (2)

= 9 + 4 for g £ {2,3}

We further normalize the aggregated visits because our analysis focuses on a similar development of the traffic flow independent of the actual height of traffic. The normalization takes place per street segment and is defined using Equation 2.

3.3. Clusterization

We applied clustering, an unsupervised learning method, in order to find groups of street segments with a similar course of traffic flow over time. Each street segment is described by a set of 72 attributes which each contain the normalized number of visits in the respective aggregation unit (3 groups of weekdays, 24 hours per weekday group). We tested three different types of clustering algorithms, namely hierarchical clustering, density-based clustering DBSCAN (Ester et al., 1996) and partitioning clustering (¿-means) using the WEKA package for R - RWEKA (Hall et al., 2009).

We used this intuitive R plug-in for displaying the clustering diagrams and plot them directly on an OpenStreetMap map layer. We obtained the best results with ¿-means clustering. The hierarchical clustering posed the problem that the criterion for the final selection of the number of clusters was ambiguous. With DBSCAN the parameter selection turned out to be difficult. We either obtained one single cluster or a very high number of small clusters without clear interpretation. For ¿-means we varied the number of clusters between 2 and 8 and selected a size of ¿ = 4 clusters for New Haven. Fig. 3a shows the obtained curves of fit for each number of clusters. "Final" refers to the fit obtained by the ¿-means, best model (in terms of volume, shape and orientation of the clusters), while "Reference" refers to the values obtained using the DBSCAN (equal volume, shape and orientation for all clusters). The "optimal" model has 4 clusters. However, it is recognized that determining the number of clusters solely on the basis of a goodness-of-fit measure is likely to favor larger numbers of clusters, which might also be more difficult to interpret from a traffic flow theory point of view. Furthermore, considering the incremental benefits of additional clusters (resulting in more complicated traffic descriptions), one notices that a smaller number of clusters appears to give a fit close to the optimal. Table 2 shows the resulting final cluster sizes.

Table 2. Size of the resulting clusters

Cluster ID No. of Cluster Color


1 865 Blue

2 1982 Green

3 127 Pink

4 2356 Yellow

3.4. Temporal Clusters Analysis

Fig. 3b shows the cluster means developing over time. Remember that the first third of the x-axis corresponds to the average traffic distribution on working days; the second third corresponds to the distribution on Saturdays and the last third to the distribution on Sundays and holidays.

New Haven, Connecticut. USA

= -27000-


—dusteTl —du5ter2


Fig. 3. (a) Determining optimal number of clusters; (b) Traffic flow clusters variation over time using k-means clustering.

New Haven displays a typical movement patterns on working days, which are also very similar across the different clusters. The working days are characterized by 3 peaks which occur at the index hours 7, 13-14 and 17. These peaks stand clearly for the travel to and from work in the early morning and late afternoon as well as for movement during lunch break or the return trip of part-time workers. The clusters contain the majority of segments and possess a high inner-cluster variation. The smoother shape thus results in part from averaging over large numbers. For New Haven activity on Saturday is shown by Clusters 1, 2 and 3, however, with a shift in their peaks. Cluster 2 shows the highest peak at index hour 10 and remains at a raised level of activity until early evening. Again, this activity corresponds to the opening hours of shops in New Haven (9:00 till 21:00 o'clock). Cluster 3 has its peak on Saturday at index hour 12. Afterwards the activity decreases and has two smaller peaks at early evening and Sunday early morning hours. The characteristic peak of Cluster 3 lies on Sunday at index hour 13, however, is accompanied by a raised level of activity between index hour 10 and 18. The time periods on Saturday as well as Sunday correspond to hours of leisure activities marking especially the times for lunch and dinner. Finally, Cluster 4 shows high activity during Saturday afternoon as well as Sunday morning and afternoon, which also indicate leisure activities.

3.5. Spatial Clusters Analysis

In this section we consider the spatial distribution of the street segment clusters in order to gain a better understanding of the road usage clusters and to establish a connection between the traffic patterns. Fig. 4a shows a OpenStreetMap layer of New Haven with street segments colored according to their cluster id. Cluster 4, drawn on the majority of street segments, does not allow us to draw specific, location-dependent conclusions about the usage behavior and traffic pattern. However, Clusters 1, 2 and 3 allowed for further interpretation by visual analysis. Fig. 4b shows details for Cluster 1. Its segments are located along the North Atlantic Ocean. The area marked A contains sports facilities which are most likely used for sport events during the weekend. The segment allowing entrance to the area marked B gives access to an open-air swimming pool, which also contains access to the ocean. Finally, area C contains the harbor and promenade, including a landing place for ferries and boats and a metro/subway station. All of these areas are intended for leisure and recreational activities, and confirm the temporal patterns described in the previous section.

Fig. 4. (a) Clustering New Haven - left side; (b) Some representative areas from Cluster 1

The visual analysis of Cluster 2, which showed a characteristic peak on Saturday morning as well as a high activity during the opening hours of shops, showed that the cluster contains in part collections of street segments in the inner city of New Haven as well as in its surrounding suburbs. However, segments are also located within residential areas and on access roads. Thus, the segments are not only characterized by shopping activities, but also by access functionality and homely activities. Clearly, the last two show in their temporal behaviour the departure and arrival times of people. Finally, the spatial distribution of segments belonging to Cluster 3 are distributed over New Haven and have primarily an access function to residential areas or major roads. Thus, the cluster symbolizes especially movement to and from residential areas, which is plausible considering main peaks around lunch and dinner time as well as peaks on late Saturday evening and early Sunday morning. We suppose that these characteristics show mainly on access roads instead of directly within the residential areas because these streets are likely to contain more visits and thus yield more stable result.

4. Experimental Results

The experimental part of this research was performed also within RGui software using the MCLUST package (Cran Packages, 2014). It is an R package for normal mixture modeling via model-based clustering, classification, and density estimation. We tried to emphasize, using different parameters, the traffic patterns for New Haven urban area. Fig. 5a provides a visual representation for the aggregation of the aforementioned 4 clusters described in Section 3.3 taking into account the traffic flow for those road segments.


New Haven, CT, USA - 4 clusters

km/h veh/h veh/km/lane

80 120 3000 6000 9000 10 30 50

100 150

10 30 50

200 250 300 a "

Fig. 5. (a) Traffic patterns clustering based on the traffic flow; (b) Introducing speed into the clustering


At a first look we can synthesize that, as the number of the vehicles visiting a road segment raises, the density level for that segment also raises. We expected these values for most of the road segments included in this research, but there are some cases where the flow remains stable, but the density gets higher values. These situations might be classified as being predisposed for traffic congestion occurrence. Several observations can be based on this figure. For example, the restrictions on the volume, shape and orientation of the clusters limit the ability of the clusters to adequately reflect the shape of the data. As a result of these restrictions, many results in these fields cannot be formed in a way that is consistent with traffic flow theory. For example, the higher-density region of the data is not captured at all; so furthermore, the clusters that have been created cannot be behaviorally explained. We have relaxed these restrictions, allowing the clusters to be formed in ways that are more meaningful and consistent with traffic flow theory.

In the Fig. 5b we plotted the same 4 clusters but we introduced the speed values for those drivers visiting the monitored road segments. As we can see there, the road segments from the recreational areas (blue cluster) from New Haven exhibits greater speeds and small values for the traffic flow. Having almost the same traffic flow values, the road segments placed in the shopping area or the centre of the city (green cluster), counts speeds up to 65 km/h. The quality of the clustering and the realistic traffic pattern analysis can be validated using these scattered plot diagrams. Another good example is the Speed-Density diagram where we observe that speed is inversely with the density. Great density implies lower speed values.

Fig. 6. (a) Congestion level variation at different timestamps; (b) Traffic flow plotted directly on the road map segments for Yale region

To understand better the dynamics of the traffic we have computed the traffic congestion index (TCI). The traffic congestion index (TCI), introduced by Shrank and Lomax (2005), is a measure of the Daily Vehicle Mile Travel (DVMT) per road segment-mile of freeways and principal arterial streets. It is an empirically derived formula to quantify the relative congestion levels in urban areas. We found it suitable for our research because its index allows for comparison across metropolitan areas by measuring the full range of traffic network performance. The main focus is on the physical capacity of the roadway in terms of vehicles. As shown in Fig. 6a, the road network is displaying severe congestion level between 7:45 AM and 8:45AM and between17:15PM and 18:45PM on weekdays. The top congestion levels appear in the peak periods when the max congestion index value is more than 8.0. Analyzing the same month from consecutive years we can conclude that the traffic congestion moments have not changed too much. We can use this information for future predictions regarding traffic flow problems.

Continuing our analysis on traffic patterns for our region, we used once more the RGui software to plot the levels of the traffic flow directly on a map of New Haven. So we used the XML package (Cran Packages, 2014) for parsing the GPX files (see Section 3.1) and the GGMAP package (Cran Packages, 2014) to plot on a Google Map layer those values for the traffic flow. The thickened road segments are encountering higher traffic flow values. This is yet another proof of the simplicity and the relevance of using R-language to achieve our objectives. We choose to display in Fig. 6b only part of the New Haven, more specifically Yale district, because of the limited space available.

5. Conclusions and Future work

The goal of our analysis was to determine whether the GPS traffic data set collected by using MapMyRun online service is sufficient to draw inference about temporal usage patterns of street segments in correlation with the traffic flow. Our results show that such an analysis is possible, however, within limits. First, we were able to include about 65% of the street segments for the entire New Haven city in our analysis when applying a threshold of 20 visits per segment. Clearly, a threshold of 20 is a lower bound and allows for noise in the data. This is clearly one reason that contributed to the two large non-divisive clusters (Cluster 2 and Cluster 4) in New Haven. However, compared to about 5400 permanent traffic counting points available by the GPS data set collected, our results possesses a good ground truth for proposed analysis. We were able to emphasize a temporal analysis that allowed explicitly to analyze the inside city traffic and the congestion occurrence. Second, we obtained clusters with temporally distinguished usage patterns. The visual inspection of these clusters showed that especially shopping and recreational activities have a unique temporal usage pattern. However, the clusters in New Haven showed also that temporal patterns cannot distinguish clearly between consecutive activities. For example, the usage of access roads is closely connected to the usage in residential areas where individual movements typically start or end. Third, the clustering showed that the most characteristic time span to distinguish usage patterns is the weekend. We identified groups with a similar temporal traffic distribution by adjusting our clustering mechanism and interpreted the results based on temporal and spatial background knowledge. Thus we provided a novel and detailed analysis for a New Haven city for which we could clearly identify traffic patterns related to specific road segments flow. The current research shows that the software application R has generally good data mining capabilities as well as promising spatiotemporal data mining capabilities for the clustering process. R allowed us to connect and retrieve data sets from major traffic route service and digital map provider. R packages used in this study were well documented, easy to install and yet powerful for our purpose. Either on the side of visualization or on the side of clusters aggregation R was able to handle large traffic data sets and extensive document processing queries having no memory issues.

As future work we plan to extend our analysis by including other data sources about socio-demographic information for the drivers as well as environmental information for the street segments. Such an extension could either be integrated into the existing clustering process or be carried out subsequently following a multi-stage approach. In addition we would like to increase the granularity of our results by interpreting not only the clusters in total, but also their characteristic peaks. We also started an in-depth analysis of alternative clustering algorithms and specific model structures. These will be used within each component of our methodology in order to optimize and improve the traffic pattern analysis. We plan to compare the results obtained using GPS data against some other traffic data-sources, like loop detectors, in order to validate better our proposed solution.


Andrienko, G., Andrienko, N., 2008. Spatio-temporal aggregation for visual analysis of movements. In Proc. of IEEE Visual Analytics Science and Technology (VAST 2008), pp. 51-58, IEEE Computer Society Press.

Antoniou, C., Koutsopoulos, H.N., 2006a. Estimation of traffic dynamics models with machine learning methods. Transportation Research Record, Volume 1965, pp. 103-111, Washington, DC.

Azimi, M., Zhang, Y., 2010. Categorizing freeway flow conditions using clustering methods. In: Proceedings of the 89th Annual Meeting of the Transportation Research Board, Washington, DC.

Cran Packages, 2014. [Online]. Available:

Dunne, S., Ghosh, B., 2012. Regime-based short-term multivariate traffic condition forecasting algorithm. JTE, 138 (4), pp. 455-466.

El Faouzi, N. E., Klein, L. A., Mouzon, O. D., 2009. Improving travel time estimates from inductive loop and toll collection data with Dempster-Shafer data fusion. Transportation Research Record 2129, pp. 73-80.

El Faouzi, N. E., 2004. Data-driven aggregative schemes for multisource estimation fusion: a road travel time application. Proceedings of SPIE, Volume 5434. SPIE, Bellingham, WA, pp. 351-359.

El Faouzi, N. E., Lefevre, E., 2006. Classifiers and distance-based evidential fusion for road travel time estimation. In: Dasarathy, Belur V. (Ed.), Multisensor, Multisource Information Fusion: Architectures, Algorithms, and Applications 2006. Proceedings of SPIE, Volume 6242.

Ester, M., Kriegel, H. P., Sander, J., Xu, X., 1996. A density based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). pp. 226-231, AAAI Press.

Fitschen, A., Nordmann, H., 2010. Verkehrsentwicklung auf Bundesfernstraßen 2008 (Traffic development on federal roads 2008). Berichte der Bundesanstalt für Straßenwesen (V 191), NW-Verlag, Bremerhaven.

Gonzalez, M. C., Hidalgo, C. A., Barabasi, A. L., 2008. Understanding individual human mobility patterns. Nature 453(7169), pp. 779-782.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. H., 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations, Volume 11, Issue 1.

Karlaftis, M. G., Vlahogianni, E. I., 2011. Statistical methods versus neural networks in transportation research: differences, similarities and some insights. Transportation Research Part C 19, pp. 387-399.

Lee, J.-G., Han, J., Whang, K.-Y, 2007. Trajectory clustering: a partition-and-group framework. In Proc. Of the 2007 ACM SIGMOD International Conference on Management of Data (SIGM0D'07).

Liu, H., Van Zuylen, H., Van Lint, H., Salomons, M., 2006. Predicting urban arterial travel time with state-space neural networks and Kalman filters. Transportation Research Record, pp. 99-108.

MapMyRun, 2014. [Online]. Available:

Mitchell, T., 1997. Machine Learning. McGraw Hill.

Necula, E., 2014. Dynamic traffic flow prediction based on GPS Data, IEEE ICTAI, Limassol, pp. 922-929.

Pelekis, N., Andrienko, G., Andrienko, N., Kopanakis, I., Marketos, G., Theodoridis, Y., 2011. Visually exploring movement data via similarity-based analysis. JIIS Online First, pp. 1-49.

Piciarelli, C., Micheloni, C., Foresti, G. L., 2008. Trajectory based anomalous event detection, IEEE TCSVT, 18(11), pp. 1544-1554.

R Core Team, 2014: A language environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2013, ISBN 3900051-07-0. [Online]. Available:

Rinzivillo, S., Pedreschi, D., Nanni, M., Giannotti, F., Andrienko, N., Andrienko, G., 2008. Visually driven analysis of movement data by progressive clustering. Information Visualization, 7(3), pp. 225-239.

Schlich, R., Axhausen, K. W, 2003. Habitual travel behaviour: Evidence from a six-week travel diary. Transportation, 30:13-36.

Schrank, D., Lomax, T., 2005. The 2005 Annual Urban Mobility Report. Texas: Texas Transportation Institute.

Shaw, S. L., Yu, H., 2009. A GIS-based Time-geographic Approach of Studying Individual Activities and Interactions in A Hybrid Physical virtual Space. Journal of Transport Geography, Volume 17, pp. 141-49.

Shen, Z., Ma, K. L., 2008. MobiVis: A Visualization System for Exploring Mobile Data. IEEE Pacific Visualisation Symposium, pp. 175-182.

Shoval, N., Isaacson, M., 2007. Sequence Alignment as a Method for Human Activity Analysis in Space and Time. Annals of the Association of American Geographers, Volume 97, no.2, pp. 282-297.

Shoval, N., 2008. Tracking technologies and urban analysis. Cities, vol.25, pp. 21-28.

Song, C., Qu, Z., Blumm, N., Barabasi, A. L, 2010. Limits of Predictability in Human Mobility. Science, Volume 327, no.1018, pp. 1018-1021.

Stathopoulos, A., Karlaftis, M. G., 2003. A multivariate state space approach for urban traffic flow modeling and prediction. Transportation Research Part C: Emerging Technologies 11 (2), pp. 121-135.

Sun, L., Zhou, J., 2005. Developing multi-regime speed-density relationships using cluster analysis. Transportation Research Record: Journal of the Transportation Research Board 1934, pp. 64-71 (DC).

Underarmour, 2014. [Online]. Available:

van Lint, J.W.C., 2008. Online learning solutions for freeway travel time prediction. IEEE Transactions on ITS 9 (1), pp. 38-47.

Vlahogianni, E.I., Karlaftis, M.G., Golias, J.C., 2008. Temporal evolution of short-term urban traffic flow: a nonlinear dynamics approach. Computer-Aided Civil and Infrastructure Engineering 23, pp. 536-548.

Wang, Y., Papageorgiou, M., Messmer, A., 2006a. A real-time freeway network traffic surveillance tool. IEEE TCST 14 (2006), pp. 18-32.

Wang, Y., Papageorgiou, M., Messmer, A., 2006b. RENAISSANCE - a unified macroscopic model-based approach to real-time freeway network traffic surveillance. Transportation Research Part C: Emerging Technologies 14, pp. 190-212.

Weijermars, W., 2007. Analysis of urban traffic patterns using clustering. PhD Thesis, TRAIL Thesis Series T2007/3.