Abraham Duplaa
Jan 15, 2020

Modeling Traffic Behavior as a Function of Real-Time Traffic Flow and Weather

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.


In the first two posts in this series, I’ve demonstrated that OmniSci can interactively visualize billion-row datasets and how to enrich your analyses with weather data to deeply analyze a complex relationship like Bay Area traffic. In this final post in the series, I will demonstrate a few data science approaches using tools from the PyData ecosystem to model the relationship between incidents and weather.

Classifying the Severity of an Accident Using Random Forest

Cities and counties have a finite amount of tow trucks, ambulances, and emergency responders. Having a real-time prediction on the severity of an incident based on readily available information such as current weather and traffic data would help cities cut down on traffic jams and in allocating emergency responders to the pressing incidents needing attention. To evaluate the feasibility of a model like this, I’ll test using random forest classification, a decision tree-based algorithm to attempt to classify the severity of the accident as 0: Hazard/Not severe and 1: Severe.

Data Preparation

As a quick reminder, we have traffic data (speed, occupancy) and weather data to use for our independent variables, and we have traffic incidents data in order to derive our accident severity classification. Those can all be loaded into pandas dataframes individually to prepare the data. But how can we combine all the features into one massive table? There are no match keys in this data; rather, traffic data contains a station ID with latitude and longitude, the weather data contains a weather station ID with latitude and longitude, and finally the incident data contains a location description with latitude and longitude.

With OmniSci Visual Data Fusion (VDF), combining tables within a plot is easy. However, joining tables so that they can be used for data science can often be awkward using only SQL. For this project, two functions were written in Python to loop through and join traffic data and weather data to the nearest incident data location. Considering that there are over 2,400 traffic stations, over 4,200 incident locations and 19 weather stations, there are plenty of distances to compute using latitude and longitude. Since this would’ve taken ages on a single core, a simple parallelized lat-long distance calculation function was written to map the correct traffic and weather station to the respective incident location:

Once the key has been made with the correct traffic and weather stations, we can then perform an inner join with the massive traffic data and weather data, and finally with the incident data.

The only problem left to solve for the data preparation is that traffic and weather are recorded on regular intervals, but incidents happen at any minute and hour of the day. That’s where pandas pd.merge_asof()comes in handy:

Here we can merge the incident data (df) with the traffic and weather dataframe (df_traffic_weather) on the nearest timestamp and by station. With the data together in a single dataframe, we can now move on to feature engineering.

Feature Engineering

Numerical Features

It wouldn’t make sense to use the speed and road occupancy at the moment of the recorded incident, since the authorities already know what has occurred. But it would make sense to use the information recorded shortly before the incident.  For the 30 minutes prior to the incident, we can calculate some statistical information as features for our model. By using the pandas rolling()function, we can grab the last 30 minutes of data and calculate the moving Min/Max/Standard Deviation/Mean for the speed and occupancy columns.

We can also add features comparing normal conditions to the current speed and occupancy. Essentially, these features will describe how far from normal speed and occupancy are when compared to normal conditions.  We can calculate this difference for 20, 10, and 5 minutes before the incident was recorded. Again, pandas provides the shift()andgroupby()functions to make these calculations effortless.

Categorical Features

For the features with low cardinality, we can do a one hot encoding. However, since there are over 4,200 unique incident locations, this feature should be described by using hashing:

Although hashing makes it difficult to understand which locations are the most important for determining severity, it saves us from creating 4,200 columns while still encoding the important information about the locations.

Finally, I added columns which describe whether the incident was near an off/on ramp, at an intersection, or at a stretch of highway.



With the data preparation and feature engineering complete, we’re able to train and predict our classification model. For detailed information on the training of the model, please refer to the Jupyter Notebook.

After training, our model can predict the severity with around 65-70% accuracy:

Analyzing Model Performance Using OmniSci Immerse

There are several things we can still do to increase the accuracy of our model (some of which I’ll go through in the conclusion), but we can also investigate why the model didn’t perform as well as we had hoped by using OmniSci Immerse. For example, we can investigate which features are most important and if there are any subsets of the data that were classified poorly. To do this, we can easily load the data back to OmniSci using pymapd:

And we can quickly create a dashboard to analyze the model results:

The Feature importance plot on the left shows that the speed and occupancy data before the incident is reported are highly important in making a decision. And what’s most notable is that the 6 most important features are the difference in speed, occupancy when compared to the average for that time of day and day of the week. The speed and occupancy features provide the most information, followed by a dropoff in performance that then shows that weather data and location data contribute:

Feature Importance chart in OmniSci Immerse

The analysis also shows that the model performs better on weekdays and work hours than on evening hours and weekends. Unsurprisingly, the worst features are the broad location categories such as area and freeway. They provide little information to the model on the severity of a traffic incident.

Possibilities for Improvement

Although the model performed better than a coin flip, there are plenty of things that can be tried to improve the model accuracy. The most important change would be to include more training data. Including more than 2 months of data could lead to a much more robust model in classifying the incidents by providing wider variation in the temporal attributes and weather.  Also, as we saw on the OmniSci dashboard, the engineered features were the most important. It would be interesting to try out a library such as featuretools for automated feature engineering.

Predicting Traffic Flow Using Deep Learning

Perhaps we’re interested in predicting overall traffic flow instead of accident severity. In order to utilize all the data I already have in my OmniSci database, pymapd can be used again but this time to read data from OmniSci into pandas. Pymapd provides an easy to use API to get my data from OmniSci’s backend immediately into a pandas DataFrame or a cudf GPU dataframe. Using the function con.select_ipc(“SQL Query”), the data I was just visualizing in OmniSci Immerse is back and ready to use in pandas in a Jupyter notebook:

Using LSTM with Traffic Data

With the emergence of big data and real time navigation systems, one of the next challenges currently being tackled for traffic behavior is short term traffic flow prediction. By being able to predict how traffic will be within the next half hour, cities can make sure that emergency services arriving at their destinations in 30 minutes take not only the current optimal route as given by real-time data or the usual optimal route, but the predicted optimal route. Even though a certain route may be faster at the moment, there may be a trend that the optimal route will no longer be the best option in as little as 5-10 minutes. Taking the fastest route at the time of travel instead of at the beginning of the trip could potentially save lives.

To predict short-term traffic flow, I built a neural network which utilizes long short-term memory (LSTM), a recurrent neural network architecture, to learn traffic flow and predict traffic conditions in the next 30 minutes. To do so, I trained the network with an hour of previous traffic data. To keep the model simple, I used only speed and occupancy as the features.

Once the model has been trained, I tested it on the month of February 2019. To visualize the predictions in OmniSci, it was again just a matter of utilizing pymapd to load the data into a table. By using only an hour of prior data of these two features for training, we were able to predict speed with a root mean squared error (rmse) of around 5. I then added hourly weather data to the model to see if it would improve the prediction. By doing so, the rmse decreased slightly to around 4, but not much difference:

However, when examining the data further and looking at how the models predict short term traffic flow, it’s apparent that the model which uses weather predicts closer to the actual speed when there is high rainfall:

Predicted vs. Actual Traffic Speed Using LSTM with weather

OmniSci: A Key Part of the Overall Data Science Workflow

Since the examples were just created for illustration purposes, with more effort and data model accuracy can likely be improved. One such improvement might be to use standard time series methods to remove the non-stationarity caused by daily seasonality. Another easy improvement would be to incorporate spatial data. However, across these past three blog posts, I hope I have demonstrated how easy it is to move back-and-forth between OmniSci and python applications like Jupyter notebook and pandas. This flexibility allows a data scientist to make impactful visualizations, build predictive models, and generally accelerate your data science workflow effortlessly.

We’d love to see how you would analyze and predict traffic flow, using OmniSci Core or OmniSci Cloud. If you come up with a great dashboard that finds new insights, a robust model that can accurately predict traffic, or use our Python client pymapd to do an interesting analysis, please stop over to our Community Forum and share your results. We’ve only scratched the surface of what’s possible from this dataset, so we welcome any and all feedback on this post.

Abraham Duplaa

Abraham Duplaa is a Developer Advocate for OmniSci. He is also an aspiring data scientist and has previous industry experience in oil and gas and the automotive sector. He is currently pursuing a M.Sc. in Computational Science and an honors degree in Technology Management from the Technical University of Munich.