Data Bloom

Store Massive Datasets in the Data Lake and Build Self-Service BI

IT, the key custodian of data, maintains databases, data warehouses, datamarts, and data lakes to service various enterprise functions like finance, human resources, sales and marketing, supply chain management, product development etc. Yet, IT is blamed by businesses as too slow.

"It's my data, let me have it my way!"

Underneath, we show how Spark and Hive data lakes make self-service BI (even on a individual laptop) a breeze. Everyone is able to go from data acquisition to self-service rich interactive business intelligence dashboards in no longer than 10 minutes. No middlemen have to come between business and their business intelligence solutions. IT must empower business units by delegating BI, not encumber.

For this quick demo, we will use a sample sales dataset on the tableau website. First initialize the data lake -- specifically the Hive data lake -- for creating and managing persistent tables from Spark.

In [2]:
import findspark
findspark.init()

from pyspark import SparkContext
from pyspark.sql import HiveContext

# Make sure to initialize the HiveContext for persistent managed tables across multiple sessions
# The session (this active notebook) used to write is different from the ODBC session later serving BI platform
if 'sc' not in vars():
    sc = SparkContext()
if 'hiveCtx' not in vars():
    hiveCtx = HiveContext(sc)

Ingest, infer and store the data set in the lake. Again, make sure to create it in the Hive context

In [2]:
# Some preliminary imports
import re
import pandas as pd
import pyspark.sql.functions as F

# Read Excel file from the web and ingest
opportunities = hiveCtx.createDataFrame(
    pd.read_excel(
        'https://community.tableau.com/servlet/JiveServlet/downloadBody/1236-102-1-1149/Sample%20-%20Superstore%20Sales%20(Excel).xls'
    ),
    samplingRatio=1.0)

# Ingest/infer and make names friendly to Hive
opportunities = opportunities.select([
    F.col(col).alias(re.sub('[^0-9a-zA-Z]+', '_', col))
    for col in opportunities.columns
])

# Create a temporary table in the Spark catalog first
opportunities.createOrReplaceTempView("opportunity_tbl")

# Purge any prior tables in Hive to refresh with latest data
hiveCtx.sql("drop table if exists default.sales_sample")
# Register temporary table from Spark as a managed table in the Hive warehouse
hiveCtx.table('opportunity_tbl').write.saveAsTable(
    'default.sales_sample', mode='overwrite')
In [3]:
# Show a preview of the data from the Hive data lake
opportunities.limit(5).toPandas()
Out[3]:
Row_ID Order_ID Order_Date Order_Priority ... Product_Name Product_Container Product_Base_Margin Ship_Date
0 1 3 1286928000000000000 Low ... Eldon Base for stackable storage shelf, platinum Large Box 0.80 1287532800000000000
1 49 293 1349049600000000000 High ... 1.7 Cubic Foot Compact "Cube" Office Refrigera... Jumbo Drum 0.58 1349136000000000000
2 50 293 1349049600000000000 High ... Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl Small Box 0.39 1349222400000000000
3 80 483 1310256000000000000 High ... R380 Small Box 0.58 1310428800000000000
4 85 515 1282953600000000000 Not Specified ... Holmes HEPA Air Purifier Medium Box 0.50 1283126400000000000

5 rows × 21 columns

Ensure Table Exists

In [4]:
hiveCtx.tableNames()
Out[4]:
[u'sales_sample', u'opportunity_tbl']

Fire Up a Thrift Server

The managed Hive catalog now contains the new dataset. In order to expose this dataset for self-service BI tools like PowerBI, Tableau, QlikView etc. we must open a ODBC/JDBC Thrift server. Before we do, make sure that the hive-site.xml (placed in the Spark Classpath) contains the hive.metastore.uris=thrift://metastoreserveruri:9083

Fire up the thrift server as follows at a command prompt

spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --hiveconf hive.server2.thrift.port=10000

Connect and Build BI

Now that ODBC server is running on top of the data lake, connect to the lake from BI tools using the ODBC dialect. Client side setup instructions are included in the PDF below.

In [5]:
from IPython.display import IFrame, Image
IFrame("./SelfServiceBI.pdf", width=1000, height=1100)
Out[5]:

Build Rich Interactive Visual BI

In [6]:
Image('SelfServiceBI.PNG')
Out[6]:

Conclusion

Everyone should be able to build rich interactive BI solutions without intervention by IT. You see above how easy that is.

BI Appliance

There is an additional with the big data lake shown above. Take the case of a real corporate travel management company (TMC). Besides the travel servicing and duty-of-care excellence, the TMC prides on providing advanced data analytics and insights to corporate travel managers. Yet, one feedback that is constant is --

Give me my insights quickly/realtime

Give me a direct connection to the master lake so my travel-scientists have transparency to the raw data

Give me the freedom to choose my BI solution; do not jail me into your solution

From the TMC perspective, opening firewalls into the master lake is fraught with security, access control, identity management, performance, policy, and quality challenges. Instead of opening conduits to the master lake that contains sensitive information from multiple clients, here is a innovative solution.

Create micro-replicas (primed with the client-specific data) of the master lake as big data docker appliances. Enable ETL (Pig, Spark), warehouse (Hive, Shark), and BI (Zeppelin, ODBC/JDBC Thrift) on the appliance. Deliver the docker image as a self-service BI appliance to remove cross-client risk and empower concurrent self-service exploration by the clients on a day-to-day basis.