Abraham Duplaa
Jun 4, 2019

Analyzing Historical Traffic Data in Real-Time with OmniSci

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.


Whether you live in San Francisco, Munich or Beijing, traffic is always a headache. Especially now that most jobs are located in these urban areas, daily commuters are pushing highway systems to the limits. This causes problems for residents and even bigger ones for city planners. Luckily, as tech savvy commuters, we’ve become experts in using tools like Google Maps and Waze to make driving a bit more bearable. I can’t remember the last time I drove without using some navigation tool to check the optimal route. With that being said, the definition of the ‘optimal route’ has also changed with more technology and historical traffic data. Just finding the shortest distance route doesn’t cut it anymore, we’re now interested in which route will save us the most time. Nowadays, we can pull up current traffic flow maps in the city and pick the route with the least amount of congestion and headache.

Aside from daily commuters, cities and municipalities are collecting massive amounts of traffic flow data each day and using it to make some pretty important data-driven decisions. With live monitoring of traffic, authorities can identify where there could be potential bottlenecks, dangerous intersections, and where to build more lanes.

Since OmniSci’s headquarters is in San Francisco, we wanted to see just how vehicular traffic flow has been changing in our region. Even though sitting in traffic is one of my least favorite things (I can only listen to so many podcasts), seeing how Bay Area traffic patterns have changed is pretty exciting. By using OmniSci’s traffic flow analysis tools, we can visualize and analyze a billion rows of 5-minute traffic data from San Francisco in real-time.

Obtaining Historical Traffic Data

Caltrans Traffic Data

The data was obtained from California’s Department of Transportation’s (Caltrans) Performance Measurement System (PeMS) Data Clearinghouse and is publicly available. Caltrans PeMS provides an abundance of traffic historical data for all of California, dividing the state into 12 districts; San Francisco is in district 4.

Multiple telemetry data series are available, ranging from speed to incident traffic information. To test out OmniSci Immerse’s capability of automatically resampling data for clearer visualizations, we chose to work with the 5 minute samples. The data is recorded by stations located throughout the freeways and the stations collect a variety of data including speed (in miles/hr) and occupancy (percent that the lane is full).

To get an understanding of how commuter traffic has changed in previous years, we’ve decided to analyze traffic from 2015 to 2019. Once the traffic history data was downloaded, there were still some data quality issues to take care of, including:

  • Handle missing values correctly. The method used for cleaning depended on the data element.
  • Unnecessary columns removed. Since we weren’t interested in each lane’s speed, but rather the average speed at the station, the lane specific columns were dropped.
  • Additional useful columns were added (e.g. hour of day, day of year)

After cleaning, the dataset is just over 1 billion rows of historical traffic data, which I then uploaded to OmniSci using pymapd.

Analyzing Bay Area Traffic Flow

Now the fun starts! With the traffic flow dataset loaded into the Omnisci database, we can use Immerse to really dive in:

Data Quality

Let’s start with some quick mental validation of the data. Just to make sure more cars on the road correlates with slower traffic, we can see that graphically when plotting Occupancy vs Speed (bottom right chart of the dashboard above).

Cyclic nature

Now let’s move on to analyzing the entire four year dataset. It’s interesting to note how cyclic traffic is each year. It’s pretty intuitive that traffic is cyclic every day (i.e. slower during rush hour, faster during night time) but driving patterns are also cyclic throughout the year. We can see that occupancy of the freeways peaks in the summer of each year, which also corresponds with a  downward dip in speed in June. Interestingly, even though June usually has the highest occupancy, July also always has a lower occupancy than June and August:

When zooming in, we can see that having 4th of July fall in the middle of the week means less cars and faster drivers and it significantly lowers occupancy for the month. Travel tends to occur weekend before the holiday:

Occupancy Issues and Bottlenecks

Another big concern for cities is where to expand roads to alleviate the highway’s traffic flow. One way to visualize which freeways are having occupancy issues is to use OmniSci’s heat map visualization. By plotting the freeways vs time, we can see which freeways are having the highest occupancy throughout time:

It looks like freeways 237, 238, and 880 consistently are above the Bay Area average for all four years and seem to be in need of a capacity increase. However, what’s more interesting to note are the freeways which are going from a low occupancy freeway (green) and slowly creeping towards higher occupancy. This Bay Area traffic data visualization shows that many freeways are increasing in occupancy,  the left hand side is much greener than the right. In this way, municipalities can identify freeways at risk of higher occupancy and proactively execute countermeasures.

We can also visualize traffic spatially by using OmniSci’s hexmap visualization. With this view, it’s apparent that San Francisco is a bottleneck for traffic in district 04 based on speed:

Furthermore, traffic slows down once the freeways get closer to all the cities, especially San Francisco and San José.

When taking a closer look and switching the Bay Area traffic map theme to ‘streets’, now we can easily identify which freeways have the worst traffic. The traffic east of the bay is significantly worse than to the west. Also, there seems to be bottlenecks in unexpected regions. One potential region to look into is Pleasanton, a suburb of the Bay Area. Highway 680 seems to historically encounter a bottleneck in this region.

Commuting to San Francisco

Aside from finding historic bottlenecks, we can also analyze commuter behavior to San Francisco:

For being regarded as one of the worst commuter cities in the US, an average of around 55 miles/hr doesn’t seem terrible. However, the standard deviation reveals that speed varies significantly during rush hour times. Work schedules seem pretty similar throughout the work week. Most people are on the road in the mornings between 5AM-10AM and heading home between 3PM-7PM. It’s pretty clear to see which days are Saturday and Sunday.

Now let’s say you’ve lived in San José for the last 4 years and that you drive to San Francisco during morning rush hour (a very unfortunate commute). We can use OmniSci’s in-chart filters to easily visualize which freeway has historically been the quickest during these hours. First, we select our parameters in the appropriate visualizations:

We’ve selected the hours between 7AM-10AM, only business days, only northbound traffic, and through the map, we’ve selected the area between San José and San Francisco. Almost instantly, we can obtain the speed of each freeway northbound between San Jose and San Francisco.

Finally, let’s take a look at last year’s commuting patterns of Omnisci employees (and other commuters) in downtown San Francisco. Now, I’ve filtered out a radius around OmniSci headquarters so that every freeway used to leave OmniSci is captured. The traffic looks pretty stagnant throughout last year when looking at the time graph. Freeways 101 and 80 definitely experience the worst rush hour effects as shown on the traffic flow map. Something to note when next visiting OmniSci: don’t take freeway 80 in the afternoon. It has a total traffic flow rate of almost 420 cars, as seen in the bottom right by its bubble size, and also experiences an almost 30 mph decrease in speed from free flow conditions!

By utilizing OmniSci’s powerful GPU database and flexibility of OmniSci Immerse, we’ve uncovered a few interesting points:

  • Anomalies in normal traffic cyclic behavior, especially in July
  • Identification of freeways with high occupancy or increasing occupancy
  • Potential bottlenecks in the region, specifically when driving through Pleasanton
  • Commuting patterns in the Bay Area

Now that we’ve analyzed and understand how traffic in the Bay Area functions spatially and through time, the next step is to use time series modeling for traffic flow prediction. My next blog post will cover how to predict traffic flow using OmniSci and TensorFlow.

In the meantime, we’d love to see how you would generate traffic flow calculations using OmniSci Core or OmniSci Cloud! The code used to download and clean the data is public on GitHub, and we’ve exported the pre-cleaned data from our OmniSci database into S3 for you to download (warning, it’s 31GB compressed). If you come up with a great dashboard that finds new insights, analyzes traffic for your area, or uses our Python client pymapd to enable more powerful analysis of your data, please stop over to our and share results. We’ve only scratched the surface of what’s possible from this historical traffic flow data set, so we welcome any and all feedback on this post.

Abraham Duplaa

Abraham Duplaa is a Developer Advocate for OmniSci. He is also an aspiring data scientist and has previous industry experience in oil and gas and the automotive sector. He is currently pursuing a M.Sc. in Computational Science and an honors degree in Technology Management from the Technical University of Munich.