Oct 13, 2016

Speeding Through NYC: The Billion+ Row NYC Taxi Dataset

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.


To jump straight to the demo click here.

New York City is special to us. It’s not where we started (Boston) or ended up (San Francisco) It is special because, it remains, in terms of America, the center of it all.

While not the geographical center of it all, it is the data center for us. When we do demos of the Tweetmap, we end up at the Empire State Building. When we do demos of Political Donations we end up looking a the island of Manhattan (and at Donald Trump’s past inclinations to give to Dems). Even when we look at flights, we focus on the mighty triumvirate of Newark, JFK and LaGuardia.

With the latest addition to our public demos, we have the absolutely spectacular 1.2 billion row taxi/limo/uber/lyft dataset from NYC. The dataset is comprised of staggering detail (full GPS, transaction type, passenger counts, timestamps) from January 2009 through June 2015 (essentially the birth of rideshare).

Released by the New York City Taxi & Limousine Commission as part of a FOIA request, the dataset became a darling of the data science set while also emerging as a popular test of database query speed. In a previous post we detailed how database enthusiast Mark Litwinshik benchmarked us on this dataset and found us orders-of-magnitude faster than the CPU-based competition.

Since the dataset has been in the public domain for a bit, we thought it would be fun to spruce it up a little. We turned to our friends over at Factual to to help us in that regard. Through them we added the location of every business in NYC.

The initial impetus to add in the Factual data was our demo at Finovate. Our goal was to show how to separate signal from noise and generate insights at scale.

Todd gives an overview of the dataset in this short video.

Without making this blog post exhaustive, here are some things to look for:

  • Find the trends around cash and credit over time. When you will do you will appreciate the impact of the ride sharing economy.
  • Find the “commuter confidential” tricks that get played around bridges. Hint, it is in the color of the ride (indicating its destination)
  • Find every Starbucks in NYC see how many rides are being dropped off at the chain.
  • Find all of the Hyatt’s and look at how the stock price mirrors its fortunes in NYC (noting of course this is a starting point for further investigation)
  • Discover what Uber/Lyft did for the Hamptons starting in 2014 (this will require some extra skills - click on the settings gear on the point map and increase the size of your dots)
  • Extra bonus if you can figure out who on Hedges Lane is using it as their primary mode of transportation from Manhattan (Uber estimates it $187 min for an UberX up to $630 for a SUV)
  • Figure out how many simultaneous events there were in NYC on September 19, 2010 to cause such a massive spike in traffic (hint, Google has the answer)
  • Figure out what event caused the huge drops in traffic (Google also has the answer)

The takeaway, for even the casual user should be clear - you are effectively driving the equivalent of a supercomputer.

The entire experience is driven by eight Nvidia K80 cards, meaning that each query is being run on nearly 40,000 GPU cores.

The reason we can render 1.2 billion points on Mapbox is because we are not actually rendering 1.2 billion points in the browser.

This is actually important.

We cover it in detail in this excellent post, but to summarize, we use the GPU to render the image, compress it to a .png (about 100KB) and send it to the browser as a tile. This allows for lightning fast rendering and the perception by the user that all of this data is actually in their browser.

The reason that the queries respond so quickly is also important.

Our use of LLVM is a critical part of the technology stack and is also covered in detail in a blog post. In short, however, LLVM allows perform query compilation at the speed of handwritten queries but with portability.

While there are certainly other ways we squeeze every last cycle out of a GPU, those are better left for a conversation with us.

In the meantime, play around, share your findings on Twitter or LinkedIn and generally geek out with this incredible experience.


HEAVY.AI (formerly OmniSci) is the pioneer in GPU-accelerated analytics, redefining speed and scale in big data querying and visualization. The HEAVY.AI platform is used to find insights in data beyond the limits of mainstream analytics tools. Originating from research at MIT, HEAVY.AI is a technology breakthrough, harnessing the massive parallel computing of GPUs for data analytics.