Veda Shankar
Jun 4, 2018

Bitcoin Transactions: From BigQuery to MapD

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.

GET FREE LICENSE

Bitcoin, cryptocurrency, and blockchain technology were hot topics in many 2017 water cooler chats, especially after the meteoric rise in Bitcoin prices late last year. Despite Bitcoin’s popularity, very few of us, myself included, understand the inner workings of an actual Bitcoin transaction and how this growing public ledger is maintained.

But wait, I work for MapD - the most powerful visual analytics engine on the planet - and extremely large data sets are perfect fodder for MapD. So, I ingested the raw Bitcoin transactions into the MapD Core analytics database, along with other publicly available market data. Now, we can explore the correlations between the different datasets and visualize them in MapD Immerse visual analytics.

Later in this blog, I show how with a few clicks I was able to visually drill down into the transactions during the wild ride of late 2017 through early 2018, where I find all the million dollar transactions that involve the top 100 popular bitcoin addresses.

Steps

In this post, I detail the following steps:

  1. Source the raw data from Google BiqQuery and copy it to a bucket in Google Cloud Platform
  2. Source related market data from Quandl, a marketplace for financial, economic and alternative data, and download it locally
  3. Give MapD Server access to the cloud bucket using gsutil
  4. Using a Python application, merge the Bitcoin transaction data with other pertinent data and ingest the data into MapD Core
  5. Use MapD Immerse to chart and explore the data together

Bitcoin Data on BigQuery

There are many ways to download Bitcoin blockchain to examine the transactions. You can set up one of these Bitcoin clients which will download the entire blockchain, and you can extract the individual transactions from that. Setting up a Bitcoin client is resource intensive from a bandwidth, storage and computing perspective. So I used Google's BigQuery, which I find to be the easiest way to get all of the Bitcoin transactions with the least amount of setup time. Of course, you need an account on Google and access to Google Cloud Platform. Bitcoin blockchain transaction data is now available on BigQuery's bigquery-public-data:bitcoin_blockchain dataset and is updated every 10 minutes.

Before we query and export the data from BigQuery, you need to create a Bucket on Google Cloud Platform Storage to store all the transaction objects. Go to Google Cloud Platform dashboard and select Storage.

 Google Cloud Platform-Storage

On the Storage dashboard, select CREATE BUCKET and give a unique name to the bucket.

Create Bucket

Make sure that the newly created bucket appears on the bucket list on the storage browser page. Now you are ready to use this bucket to download the Bitcoin transactions.

Go back to Google BigQuery and select bigquery-public-data -> bitcoin_blockchain -> transactions.

bigquery-public-data

Use the Export Table option to export all the bitcoin transactions to the newly created cloud storage bucket. Notice that we have selected the export format to be JSON and enabled compression.

Export table

Click on Job History to monitor the progress of the export task. The task will run for couple of minutes and report success with a green check mark (see below).

Job History

You should also click on the bucket under the Storage Browser and see thousands of objects which are gzipped Bitcoin transactions.

Storage Browser

Bitcoin Data on QUANDL

The Bitcoin blockchain transaction information you downloaded from BigQuery contains details on each transaction like the date, the amount of Bitcoin value that is transferred from one Bitcoin address to another, block information, etc. You will find detailed information about all the fields that comprise a transaction under Table Details: transactions -> Schema.

This raw data from each transaction on its own is not very interesting if you are not mining or validating the transaction. If we combine this raw transactional information with market data on Bitcoin then we can see interesting patterns and highlight trends on this fascinating financial instrument.

Quandl is a platform for financial, economic and alternative data that serves investment professionals. Quandl sources data on Bitcoin Blockchain and updates on various interesting Bitcoin related metrics on a daily basis. The data can be downloaded as a CSV file for each individual metric. You can select whatever metrics interest you; I downloaded the following metrics:

  • BCHAIN/MKPRU Bitcoin Market Price in US Dollar
  • BCHAIN/ETRVU Bitcoin Estimated Transaction Volume in US Dollar
  • BCHAIN/TRFUS Bitcoin value of transaction fees miners earn per day in US Dollar
  • BCHAIN/NTRAT Total Number of Unique Transactions per day
  • BCHAIN/NADDU Number of unique Bitcoin addresses used per day
  • BCHAIN/NTREP Number of Transactions Excluding Popular Addresses
  • BCHAIN/DIFF Bitcoin Difficulty

Each of these metrics is updated daily and can be downloaded as a CSV file, which has two fields - date and metric value. I joined these CSV files into a single CSV file (btc_quandl.csv) using the date field as a key. I also downloaded the 100 most popular Bitcoin addresses into the CSV file popular_btc_addresses.csv.

gsutil - Accessing Google Cloud Storage

gsutil is a tool that enables you to access Google Cloud Storage from the command-line. gsutil is part of the Google Cloud SDK, but I cover the standalone installation of gsutil because it’s the only tool required to access the cloud storage. For detailed instructions on installing Google Cloud SDK/gsutil, please refer to the website Install gsutil.

Use gsutil to access the Bitcoin transaction data that you downloaded to your Cloud Storage bucket. Install gsutil on the MapD server if you have shell access so that the data can be ingested directly from Cloud Storage into MapD table after preprocessing without involving any intermediary server. If you don’t have shell access, gsutil and the preprocessing scripts can be installed on any Linux server for staging the data before uploading it to MapD.

  1. Download gsutil as gsutil.tar.gz
  2. Extract the files into your home directorytar xfz gsutil.tar.gz -C $HOME
  3. You will notice that a gsutil directory is created which contains the tool and associated files.
  4. Add gsutil to your PATH environment variable exportPATH=${PATH}:$HOME/gsutil
  5. To authenticate gsutil to have access to your cloud storage, typegsutil config
  6. You will be prompted with a URL for the authorization page where you can authorize gsutil to access Cloud Storage. The gsutil dialog looks similar to the following:
gsutil dialog
  1. Copy and paste the URL into a browser window. If prompted, sign in with the Google account associated with your Cloud Storage data in the browser window.
  2. Click the Allow Access button. This authorizes gsutil to access Cloud Storage.
gsutil authentication
  1. Copy the authorization code that appears on the next page into the gsutil prompt and press Enter. After you provide an authorization code, gsutil prints the following:
gsutil prints the following
  1. Log in to the Google Cloud Platform Console to find a project ID. Find the ID by clicking Manage all projects found in the drop-down in the upper right and enter it at the gsutil prompt.
  2. Confirm that you can now access your cloud storage by typing:gsutil ls

Workflow for ingesting Bitcoin transactions into MapD

The list below captures the workflow for transferring from Storage Cloud each object containing the Bitcoin transactions in compressed JSON format to a CSV format that can be ingested into a MapD table. The application has been written in Python and you can download the code from its github repo.

The Python application performs the following steps for each object in the Storage Cloud bucket:

  • Copy Object from Storage Cloud to chunk.json.gz
  • Gunzip chunk.json.gz
  • Execute btc_json2csv() to extract transactions from JSON format to CSV format
  • Created btc_chunk.csv
  • Read btc_chunk.csv and btc_quandl.csv (merged tables of all BTC metrics) into Pandas data frames
  • Perform an inner merge on the two data frames using the transaction_date field
  • Insert columns input_address_type & output_address_type
  • Set the address type to either ordinary or popular based on whether the Bitcoin address in the transaction belong to the Top 100 popular Bitcoin address list (popular_btc_addresses.csv)
  • Write out the data frame to a CSV file btc_chunk_final.csv
  • Connect to MapD
  • Create MapD table if it does not exist
  • Using COPY FROM command load data into MapD table
  • Close connection to MapD

A MapD table is created with the following columns:

Column Name

Data Type

Column Name

Data Type

transaction_date

DATE

block_id

TEXT ENCODING DICT(32)

transaction_fees_usd

FLOAT

prev_block_id

TEXT ENCODING DICT(32)

transaction_volume_usd

FLOAT

input_sequence

BIGINT

btc_difficulty

FLOAT

input_pubkey

TEXT ENCODING DICT(32)

btc_price_usd

FLOAT

output_pubkey

TEXT ENCODING DICT(32)

btc_addresses_used

FLOAT

output_satoshis

FLOAT

transactions_exclude_popular

FLOAT

input_address_type

TEXT ENCODING DICT(32)

total_transactions

FLOAT

output_address_type

TEXT ENCODING DICT(32)

transaction_datetime

TIMESTAMP

output_btc

FLOAT

transaction_id

TEXT ENCODING DICT(32)


Creating Charts in MapD Immerse

With the Bitcoin transaction data now loaded into MapD, we are ready to explore it! Connect to Immerse at port 9092 on the MapD server instance with your browser. Using MapD Immerse you can visually interact and analyze the vast trove of raw Bitcoin transactional data and discover interesting facts, which may be hard to find in extremely large datasets. Immerse provides instantaneous representations of your data, from basic charts to rich and complex visualizations. Charts are comprised of measures and dimensions. Measures are quantitative, numerical values or aggregated values (SUM, AVG, etc). Dimensions categorize measures and are used for grouping of data. A detailed explanation of the different chart types and Immerse features can be found in the Immerse User Guide.

I found the Combo Chart feature of Immerse most insightful while creating the dashboard for the Bitcoin data. I used the transaction date for the dimension, which is represented on the X-Axis, and selected the following measures which are rendered on the Y-Axis:

  • Average number of unique Bitcoin addresses used per day
  • Average number of transactions per day excluding popular addresses
  • Average total number of unique transactions per day

As the total number of transactions ranges in the 100s of million per day, I moved the other measures to a secondary Y-Axis. Also, the total number of transactions is represented as a vertical bar chart while the other measures use a line chart for clarity.

Combo Chart - Bitcoin Transactions

Combo Chart - Bitcoin Transactions

I use combo charts to show a correlation between related metrics in a Bitcoin transaction. One chart plots the price of Bitcoin in US dollars versus the Bitcoin transaction volume in US dollars. I use another chart to show the total transactions occurring daily alongside the number of transactions per day that don't include the popular 100 Bitcoin addresses. A third chart shows the Bitcoin mining (computational) difficulty in finding a new block for transactions, alongside the amount earned by the miners as part of the transaction fees. While the difficulty to mine each blockchain block keeps increasing, the miners' transaction fees are related to the demand of Bitcoin, i.e., its rise in price as seen on the first chart towards the end of 2017 when there was a bidding war over the block space.

Combo Chart - Bitcoin Difficulty and Miners Transaction Fees

Combo Chart - Bitcoin Difficulty

I use bar charts to show the number of input or output Bitcoin address that is classified as ordinary or popular, i.e., belonging to the top 100 Bitcoin addresses. To see the input addresses with most records and transaction IDs with most records, created a bar chart for each. The bars are colored based on the sum of the output Bitcoin that was transacted. I can click around on some of the chart elements or brush across a time range, and watch the other charts on the screen update instantaneously. This feature of MapD Immerse is called crossfilter, which allows a filter applied to one chart to simultaneously be applied to the rest of the charts on a dashboard. Crossfilter is possible even with massive data sets, because MapD is not pre-indexing or aggregating any of the data. Let me click on the popular input Bitcoin addresses and see how the million dollar transactions have played out during the wild ride of 2017.

Crossfilter in action!

Crossfilter in action!

What's Next

To tell the Bitcoin story in a visual and easy to understand format, we worked with a team of bright students from Berkeley Innovation, who came up with a prototype for visualizing the Bitcoin blockchain. Based on this prototype, we are working on a MapD application that will extend the above methodology to stream the Bitcoin transactions continuously from BigQuery into MapD database tables. The MapD Bitcoin application will be accessible from any browser to visually analyze the Bitcoin data and gain insights into world's hottest financial instrument. Join t tuned.

Veda Shankar

Veda Shankar is a Developer Advocate at HEAVY.AI working actively to assist the user community to take advantage of HEAVY.AI's open source analytics platform. He is a customer oriented IT specialist with a unique combination of experience in product development, marketing and sales engineering. Prior to HEAVY.AI, Veda worked on various open source software defined data center products at Red Hat.