Store Massive Datasets in the Data Lake and Build Self-Service BI¶

IT, the key custodian of data, maintains databases, data warehouses, datamarts, and data lakes to service various enterprise functions like finance, human resources, sales and marketing, supply chain management, product development etc. Yet, IT is blamed by businesses as too slow.

"It's my data, let me have it my way!"

Underneath, we show how Spark and Hive data lakes make self-service BI (even on a individual laptop) a breeze. Everyone is able to go from data acquisition to self-service rich interactive business intelligence dashboards in no longer than 10 minutes. No middlemen have to come between business and their business intelligence solutions. IT must empower business units by delegating BI, not encumber.

For this quick demo, we will use a sample sales dataset on the tableau website. First initialize the data lake -- specifically the Hive data lake -- for creating and managing persistent tables from Spark.

import findspark
findspark.init()

from pyspark import SparkContext
from pyspark.sql import HiveContext

# Make sure to initialize the HiveContext for persistent managed tables across multiple sessions
# The session (this active notebook) used to write is different from the ODBC session later serving BI platform
if 'sc' not in vars():
    sc = SparkContext()
if 'hiveCtx' not in vars():
    hiveCtx = HiveContext(sc)

Ingest, infer and store the data set in the lake. Again, make sure to create it in the Hive context

# Some preliminary imports
import re
import pandas as pd
import pyspark.sql.functions as F

# Read Excel file from the web and ingest
opportunities = hiveCtx.createDataFrame(
    pd.read_excel(
        'https://community.tableau.com/servlet/JiveServlet/downloadBody/1236-102-1-1149/Sample%20-%20Superstore%20Sales%20(Excel).xls'
    ),
    samplingRatio=1.0)

# Ingest/infer and make names friendly to Hive
opportunities = opportunities.select([
    F.col(col).alias(re.sub('[^0-9a-zA-Z]+', '_', col))
    for col in opportunities.columns
])

# Create a temporary table in the Spark catalog first
opportunities.createOrReplaceTempView("opportunity_tbl")

# Purge any prior tables in Hive to refresh with latest data
hiveCtx.sql("drop table if exists default.sales_sample")
# Register temporary table from Spark as a managed table in the Hive warehouse
hiveCtx.table('opportunity_tbl').write.saveAsTable(
    'default.sales_sample', mode='overwrite')

# Show a preview of the data from the Hive data lake
opportunities.limit(5).toPandas()

Ensure Table Exists¶

hiveCtx.tableNames()

[u'sales_sample', u'opportunity_tbl']

Fire Up a Thrift Server¶

The managed Hive catalog now contains the new dataset. In order to expose this dataset for self-service BI tools like PowerBI, Tableau, QlikView etc. we must open a ODBC/JDBC Thrift server. Before we do, make sure that the hive-site.xml (placed in the Spark Classpath) contains the hive.metastore.uris=thrift://metastoreserveruri:9083

Fire up the thrift server as follows at a command prompt

spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --hiveconf hive.server2.thrift.port=10000

Connect and Build BI¶

Now that ODBC server is running on top of the data lake, connect to the lake from BI tools using the ODBC dialect. Client side setup instructions are included in the PDF below.

from IPython.display import IFrame, Image
IFrame("./SelfServiceBI.pdf", width=1000, height=1100)

Build Rich Interactive Visual BI¶

Image('SelfServiceBI.PNG')

Conclusion¶

Everyone should be able to build rich interactive BI solutions without intervention by IT. You see above how easy that is.

BI Appliance¶

There is an additional with the big data lake shown above. Take the case of a real corporate travel management company (TMC). Besides the travel servicing and duty-of-care excellence, the TMC prides on providing advanced data analytics and insights to corporate travel managers. Yet, one feedback that is constant is --

Give me my insights quickly/realtime

Give me a direct connection to the master lake so my travel-scientists have transparency to the raw data

Give me the freedom to choose my BI solution; do not jail me into your solution

From the TMC perspective, opening firewalls into the master lake is fraught with security, access control, identity management, performance, policy, and quality challenges. Instead of opening conduits to the master lake that contains sensitive information from multiple clients, here is a innovative solution.

Create micro-replicas (primed with the client-specific data) of the master lake as big data docker appliances. Enable ETL (Pig, Spark), warehouse (Hive, Shark), and BI (Zeppelin, ODBC/JDBC Thrift) on the appliance. Deliver the docker image as a self-service BI appliance to remove cross-client risk and empower concurrent self-service exploration by the clients on a day-to-day basis.

	Row_ID	Order_ID	Order_Date	Order_Priority	...	Product_Name	Product_Container	Product_Base_Margin	Ship_Date
0	1	3	1286928000000000000	Low	...	Eldon Base for stackable storage shelf, platinum	Large Box	0.80	1287532800000000000
1	49	293	1349049600000000000	High	...	1.7 Cubic Foot Compact "Cube" Office Refrigera...	Jumbo Drum	0.58	1349136000000000000
2	50	293	1349049600000000000	High	...	Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl	Small Box	0.39	1349222400000000000
3	80	483	1310256000000000000	High	...	R380	Small Box	0.58	1310428800000000000
4	85	515	1282953600000000000	Not Specified	...	Holmes HEPA Air Purifier	Medium Box	0.50	1283126400000000000