Tai Dupree
Apr 14, 2017

Quick Insight with MapD Immerse Cross Filtering

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.

GET FREE LICENSE

The MapD Immerse visual analytics client has a core feature we refer to as crossfilter, which allows a filter applied to one chart to simultaneously be applied to the rest of the charts on a dashboard. This provides a natural interface for data exploration, allowing a multi-dimensional view of data even as a user drills deep into a dataset. From a technical perspective, crossfiltering is not difficult (on the surface). Behind each Immerse chart is a SQL statement. When an element on the chart is clicked, we apply that filter to the rest of the charts on the dashboard. This is easy to do in SQL– just add it to the WHERE clause.

Political Donations Dashboard

For example, say I’m viewing our Political Donations Dashboard and I click on "Barack Obama (D) / (D)" bar on the bar chart. This chart has recipient_name and recipient_party as dimensions (GROUP BYcolumns). With this simple interaction I have just applied the SQL filter recipient\_name = ‘Barack Obama (D)’ AND recipient\_party = ‘D’ to all other charts on the dashboard. Under the hood all we have done is edit the SQL behind each chart and added these conditions to the WHERE clause, sent these statements to the backend (MapD Core) and updated the charts with the new data.

Political Donations Dashboard

Political Donations Dashboard

What I have described above is the underlying logic behind crossfiltering. This simple process allows for unparalleled interactivity when drilling down on data looking for outliers, trends, or anomalies. While other products have implemented a cross-filtered interface in some fashion, they often only allow cross-filtering through a contextual menu, partially in an effort to protect a user from the slowness of the underlying data engine. Immerse needs no layer of indirection since the MapD Core database housing the data can execute scan queries over billions of rows of data in milliseconds.

While one could certainly apply the above logic targeting any database supporting SQL, of which there are many excellent free and open source implementations, such an approach would not scale. At around the million row mark, the performance of many databases begins to suffer, at 10s of millions of rows these queries often take too long to run and Immerse charts would cease to be interactive. At a billion+ rows such as our NYC Taxi demo these queries would take tens of seconds to minutes to run on most systems. Therefore most data visualization systems to date rely heavily on sampling small portions of the entire dataset or on complex distributed cache systems. While sampling is valuable when drawing broad insights, outliers are often missed with such an approach. For certain datasets analysis is needed at the most granular level to spot the most valuable trends and long-tail events. Distributed caches are extremely useful for providing rapid updates but tend to add another layer of complexity to a technology stack and often become outdated when records in the database change.

Given the MapD Core database query execution speed we have taken a simple approach with Immerse. Since queries are so cheap, on every interaction we issue a new query, update every chart and render based on the new data. Any lag in Immerse speed most often traces to network latency and not database throughput, which cannot be said for most databases on such large datasets. That is not to say that since since our query throughput is unrivaled we can disregard network latency. On dashboards with 10+ charts the number of queries can become expensive. Especially over very large datasets and with certain interactions such as brushing on a line or histogram chart (shown below) that require multiple updates for each chart. In order to avoid issuing previously seen queries we have implemented a simple caching layer as a filter on each network request. One of the nice aspects of dealing directly with SQL is that the cache can be a simple key/value map such that the query string is the key and the response is the value. The cache holds the most seen queries and clearing it is a simple since we are only dealing with a simple javascript object.

Immerse stack

If you’re curious about the Immerse stack: we also use React to manage layout and chart updates, Redux to hold application and chart state, and have built on DC.js (which uses d3.js) for the charts. If you’d like to know more about Immerse and crossfiltering have a look at some of our demos. Click around on some of the chart elements, zoom in or out on the point map, drag your cursor over the line and histogram charts, and watch the other charts on screen update in real time.


Tai Dupree