Data science on OmniSci - 2020 recap, and new DS tools for Mac
Download HEAVY.AI Free, a full-featured version available for use at no cost.GET FREE LICENSE
NOTE: OMNISCI IS NOW HEAVY.AI
OmniSci on the Mac
Back in July of what seemed like a very long 2020, we took a small detour from the normal OmniSci product path to launch a preview/experimental version of OmniSci for Mac. We first set out to showcase our results on Mark Litwinschik’s popular analytic SQL benchmark. Along the way, we realized we had a chance to have users try out the complete OmniSci stack (minus some key GPU-dependent features) on what is still the most popular laptop used by developers.
Since then, we updated the OmniSci for Mac Preview with each new OmniSci release (and will continue to do so!). Now, we’re definitely not hiding under a rock, and like everyone else, we were super excited by Apple’s new lineup of Macs based on M1 silicon. Trust us, the ‘Mac experiment’ will continue, and we’ll keep you posted on more M1-related OmniSci news in the new year!
Meanwhile, here’s what you can do today with the Mac Preview. I loaded 605 million events (that’s every GitHub event in 2019) from the GitHub archive dataset on my 16” 2019 Macbook Pro (64GB DRAM, 8-core Intel i9). It’s always fun to see how we can scale down infrastructure while keeping our trademark interactivity at scale intact.
Data Science with OmniSci - what we did in 2020
On another front, it has been over a year since we launched our data science tools. Our goal at the time was to integrate two normally separate perspectives - the creator who wants to understand data deeply to build data products - typically a data scientist or data analyst, and the consumer of those data products, typically a business user.
OmniSci’s biggest customers are the latter group. They love Immerse, our powerful interactive data exploration UI powered by OmniSciDB. At the same time, we always wanted to provide an identical experience for the first group, the data scientists - to accelerate their workflows without forcing them to learn a new ecosystem or environment. More broadly, our goal is to make these workflows as effortless and fast as possible.
OmniSci Data Science Tools for the Mac
We will cover all the work we did in 2020 shortly, but first we’re announcing a new add-on to the Mac Preview - OmniSci Data Science Tools.
Our partners Quansight helped package all the python-based tooling we’ve worked on so far, into a single self-contained installer script, now available to download along with the OmniSci for Mac. The installer verifies if you have an existing conda installation, and updates it or alternatively installs conda as well. You get a fully self-contained OmniSci data science conda environment on your Mac laptop that includes Ibis, Altair, Holoviews, our python connectors, JupyterLab and many other tools.
You can get going with example notebooks right away. Like our enterprise edition, you can launch JupyterLab from within Immerse directly, and connect to the local OmniSciDB instance used by Immerse - no command line needed!
Meanwhile, we’re making progress on a Windows-native version of OmniSci too, and will have more to share in the new year.
Ibis, Altair and OmniSci - Declarative, Interactive visualizations at scale in Python
Our earliest work on integrating data science tools started with our pymapd connector, which included methods to output query results to Pandas dataframes, as well as Arrow buffers on CPU and GPU memory. The latter capability allows us to output directly to cudf, for example. Building on this, Quansight helped integrate a familiar API for data exploration that utilized Ibis, and then paired this up with the excellent Altair library allowing data scientists to interactively explore very large datasets outside of Immerse, in JupyterLab as well.
Over the last year, we polished the Ibis Altair integration significantly both in performance and capability. Here’s an example of what you can do with the same 605M row dataset I showcased earlier.
Interactivity at scale, Part Deux - Holoviews, Ibis and OmniSci
This year, our collaborators at Quansight helped add an Ibis backend for the excellent Holoviews project. You get another complete data visualization stack that can be powered by OmniSci via Ibis. This works with every other supported Ibis backend too, including Google BigQuery, Apache Spark and Postgres including support for geospatial capabilities where the backends support it. Besides Altair, Holoviews provides the choice of multiple pyviz libraries - Bokeh, Plotly and MatPlotlib for interactive, large-scale data exploration. You also get other pyviz tools like Panel to build complete, parametrized dashboards.
We’re going to deep dive into this soon, but here’s a quick example of using the Ibis/Holoviews combination with Bokeh for graph visualization. Of course I’m running it on my laptop! Over the next couple of weeks I’ll do a blog post on how to get to this visualization below from the raw github dataset we showed earlier.
Modin - Pandas powered by OmniSci
While Ibis is a terrific, powerful and really underrated library in the PyData ecosystem, it isn’t really meant to be a drop-in pandas substitute. That’s because it focuses (by design) on building and evaluating analytic expressions, not the full dataframe lifecycle or API, particularly when it comes to large scale data manipulation. Data scientists who don’t necessarily want to manage OmniSci as a separate component in their workflow sometimes need the full API surface of pandas, particularly during data shaping and ingestion.
Luckily there are many promising candidates to bridge the gap between the flexibility of pandas and its relatively limited scalability. Modin, part of the excellent, deservedly fast-growing Ray toolkit for distributed computing is one such alternative. Modin is a subproject within Ray, and aims to provide a drop-in (but also scalable and performant) replacement for pandas that can leverage both Ray and Dask for distributed execution.
At Converge 2019, Intel approached us about their idea to adapt the OmniSci analytic execution engine as a transparent, but scalable backend for Modin. We started to work with them, helped by the excellent Intel team including Devin Petersohn and Areg Melik-Adamyan who dived in to build a dataframe wrapper around OmniSciDB - with the goal of leveraging OmniSci’s high-performance execution engine while supporting a broader set of dataframe operations.
This means you can now get the full pandas API, with all the scale and power of OmniSci, of course. Also just like with Ibis, no SQL knowledge needed! We’re working with Intel to make this available publicly soon via their conda channel.
Meanwhile, we’re really excited that OmniSci is now part of the Intel AI and Analytics toolkit within the OneAPI software foundation. We’ll be publishing a blog about this soon, so stay tuned.
OmniSci and Apache Arrow - getting better all the time
On the other, ingest side of the workflow, Intel also helped us build a low-overhead, high-performance Arrow-based ingest path. This means it will soon be possible to build efficient, performant connectors from OmniSci to other solutions that leverage Arrow for data interchange. Stay tuned for more announcements here.
We’re really excited about these Arrow-related foundational developments. We’re working on exposing these through as many of our upstream and downstream interfaces as possible, starting with OmniSci 5.5. In addition, we’re looking at further improvements around making the Arrow transport more efficient.
In the works...
On top of all this, we also continue to do some foundational work on truly high-performance User-Defined Functions and User-Defined Table Functions capable of leveraging C++, Python and Numba. This foundation will allow us to integrate external libraries for Machine Learning much more closely with the core OmniSci execution path and embed ML/AI natively into Immerse data visualization.
Here’s a little teaser of what’s possible, leveraging a new remote compiler infrastructure that Quansight built for LLVM - you can write code in Python, and have it execute natively on both CPU and GPU as an OmniSci UDF, including the ability to use these UDFs in SQL, all in a single seamless workflow. The Remote Backend Compiler is included as part of the data science tools, and we will be publishing more extensive documentation soon. Here is an example of the complete workflow in action.
We hope you have fun with data science on your Mac with OmniSci - we will dive deeper into all these new tools over the next few weeks with a series of blog posts illustrating what you can do with them! Meanwhile, download the installer today, check out our Data Science Platform, and let us know what you think on our community forums.