Randy Zwitch
Jul 30, 2019

Announcing OmniSci.jl: A Julia Client for OmniSci

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.

GET FREE LICENSE

For a more in-depth presentation about calling OmniSci from Julia, see my JuliaCon 2019 talk

Today, I’m pleased to announce a new way to work with the OmniSci platform: OmniSci.jl, a Julia client for OmniSci! This Apache Thrift-based client is the result of a passion project I started when I arrived at OmniSci in March 2018 to complement our other open-source libraries for accessing data: pymapd, mapd-connector, and JDBC.  

Julia and OmniSci: Similar in Spirit and Outcomes

If you’re not familiar with the Julia programming language, the language is a dynamically-typed, just-in-time compiled language built on LLVM that can achieve or beat the performance of high-performance, compiled languages such as C/C++ and FORTRAN. With the performance of C++ and convenience of writing Python, Julia quickly became my favorite programming language when I started using it around 2013-2014.

In many ways, Julia and OmniSci share a lot of interesting similarities, both in the technical underpinnings and in ethos. As mentioned above, Julia is built upon LLVM to just-in-time compile user code; OmniSci uses LLVM to compile user queries. By compiling the code after the user submits it, users get a dramatic speed-up upon repeated use of the same functions/queries while still retaining the ability to interactively work with data (as opposed to the write-compile-run workflow of compiled languages).

This focus on interactivity AND performance is a key part of both user-communities as well; there are many amazing programming languages and databases that provide one or the other, but as mentioned in the Julia “greedy” introduction to the world:

“We want the speed of C with the dynamism of Ruby….We want something as usable for general programming as Python, as easy for statistics as R...Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.”

For OmniSci, it’s not enough to be able to render a choropleth every 5 seconds; we want to render 300 choropleths in that time. Running a billion-row query in 4 seconds might be ok depending on your use case, but running that same billion-row query in 140ms will change how you think about your data and the questions you can ask. With that kind of performance in mind, I started on creating OmniSci.jl, so that after I was done querying an OmniSci database I could continue on with my work in a similarly fast environment.

Getting Started With OmniSci.jl

Installing OmniSci.jl

OmniSci.jl is hosted on GitHub and part of the Julia General registry, so installing the package is the same as any other Julia package:

julia> import Pkg; Pkg.add(“OmniSci”)

Creating an OmniSciDB instance

The easiest way to set up an open-source OmniSciDB instance is to use one of our Docker containers (GPU / CPU). To start the GPU-enabled container, you need to install nvidia-docker2, then run the following command:

     docker run \        runtime=nvidia \        -d \        --name omnisci \        -p 6274:6274 \        -v /home/username/omnisci-storage:/omnisci-storage \    omnisci/core-os-cuda

Connecting to OmniSci and Running Queries

Suppose you wanted to evaluate the probability of a USAirways flight departing late from Philadelphia. To calculate this I’m using the flights_2008_7M table, which is provided as one of several example dataset choices in OmniSci (see here for instructions to install the OmniSci example datasets if you do not already have them).

To estimate the probability of a flight being late, you might calculate the following:

At 7 million records total, this dataset would hardly stress any relational database, nor does returning a 368x8 DataFrame of results. Had this example been the full billion row dataset of flights, then the power of OmniSciDB with GPU acceleration would be much clearer. However, one powerful feature of OmniSci.jl is revealed by calling eltypes(departure_delay_by_day), which reveals that the exact types of the OmniSci columns are reflected within Julia. A second example will make this even clearer:

In this example, the `omnisci_states` table contains a MultiPolygon data type, which is the Julia representation of a geospatial type from the GeoInterface.jl library. Uploading data works similarly to downloading; if OmniSci supports a data type (e.g. various integer widths, float, double, string, geospatial), then uploading the Julia version of that data type using OmniSci.jl gives you the same type inside OmniSci. No more mental gymnastics serializing and de-serializing data between tools!

Future Avenues for Collaboration

If exact typing between Julia and OmniSci were the only benefit to this package, it wouldn’t be worth an announcement. As it stands now, OmniSci.jl hasn’t reached feature parity with our Python client pymapd (which I also maintain). Pymapd has the benefit of a larger data science ecosystem in Python and industry support through the RAPIDS project for seamless data transfer on GPU using Apache Arrow. Julia, by way of being a younger language, has neither the community size nor maturity of packages (yet!).

Where I am excited about OmniSci.jl is how it opens up the opportunity for collaboration in the bigger Julia community around GPU analytics. In my JuliaCon 2019 talk (slides), I highlight four areas where Julia and OmniSci could be a killer combination:

  • Runtime User-Defined Functions: a beginning prototype of this functionality is being tested for Python, using Numba to emit LLVM IR for registering with OmniSci. Given Julia is built on LLVM and the @code_llvm macro can emit the LLVM IR for a Julia function, it should be possible to replicate the Python UDF functionality without succumbing to the “Two-language problem”. Is CUDANative.jl part of this solution?
  • GPU DataFrame for Julia: cudf for Python defines an interface and implements a GPU DataFrame. Should Julia try and adopt this standard? Or are Vectors of NamedTuples on the GPU “enough”?
  • IPC for Shared Memory Transfer: this is an area where pymapd currently has more functionality than OmniSci.jl, due to more mature libraries. But zero-copy data transfer from OmniSci to Julia via Apache Arrow is possible, creating an even higher-performance solution
  • Passing GPU buffers directly to Makie for visualization: OmniSci can already pass pointers to a GPU DataFrame; what scale of visualization would be possible if Makie communicated directly with OmniSci?

These are just a few areas where I think OmniSci and Julia could be a great pairing, and I’m certain the scientists and engineers in the community have some even better ideas. Regardless of where your interests lie, if having open-source relational database functionality on the GPU via Julia sounds interesting to you, I’d love to collaborate! Create a message on the Julia messageboard, OmniSci forum, OmniSci.jl issues, catch me on Twitter or get in touch any other way, I’d love to talk with you about your ideas about moving GPU analytics forward in Julia.

Special Thanks to Tanmay Mohapatra, whose Thrift.jl package makes OmniSci.jl possible.

Randy Zwitch

Randy Zwitch is a Senior Director of Community at HEAVY.AI, enabling customers and community users alike to utilize HEAVY.AI to its fullest potential. With broad industry experience in Energy, Digital Analytics, Banking, Telecommunications and Media, Randy brings a wealth of knowledge across verticals as well as an in-depth knowledge of open-source tools for analytics.