IT, the key custodian of data, maintains databases, data warehouses, datamarts, and data lakes to service various enterprise functions like finance, human resources, sales and marketing, supply chain management, product development etc. Yet, IT is blamed by businesses as too slow.
"It's my data, let me have it my way!"
Underneath, we show how Spark and Hive data lakes make self-service BI (even on a individual laptop) a breeze. Everyone is able to go from data acquisition to self-service rich interactive business intelligence dashboards in no longer than 10 minutes. No middlemen have to come between business and their business intelligence solutions. IT must empower business units by delegating BI, not encumber.
For this quick demo, we will use a sample sales dataset on the tableau website. First initialize the data lake -- specifically the Hive data lake -- for creating and managing persistent tables from Spark.
import findspark findspark.init() from pyspark import SparkContext from pyspark.sql import HiveContext # Make sure to initialize the HiveContext for persistent managed tables across multiple sessions # The session (this active notebook) used to write is different from the ODBC session later serving BI platform if 'sc' not in vars(): sc = SparkContext() if 'hiveCtx' not in vars(): hiveCtx = HiveContext(sc)
Ingest, infer and store the data set in the lake. Again, make sure to create it in the Hive context
# Some preliminary imports import re import pandas as pd import pyspark.sql.functions as F # Read Excel file from the web and ingest opportunities = hiveCtx.createDataFrame( pd.read_excel( 'https://community.tableau.com/servlet/JiveServlet/downloadBody/1236-102-1-1149/Sample%20-%20Superstore%20Sales%20(Excel).xls' ), samplingRatio=1.0) # Ingest/infer and make names friendly to Hive opportunities = opportunities.select([ F.col(col).alias(re.sub('[^0-9a-zA-Z]+', '_', col)) for col in opportunities.columns ]) # Create a temporary table in the Spark catalog first opportunities.createOrReplaceTempView("opportunity_tbl") # Purge any prior tables in Hive to refresh with latest data hiveCtx.sql("drop table if exists default.sales_sample") # Register temporary table from Spark as a managed table in the Hive warehouse hiveCtx.table('opportunity_tbl').write.saveAsTable( 'default.sales_sample', mode='overwrite')
# Show a preview of the data from the Hive data lake opportunities.limit(5).toPandas()
|0||1||3||1286928000000000000||Low||...||Eldon Base for stackable storage shelf, platinum||Large Box||0.80||1287532800000000000|
|1||49||293||1349049600000000000||High||...||1.7 Cubic Foot Compact "Cube" Office Refrigera...||Jumbo Drum||0.58||1349136000000000000|
|2||50||293||1349049600000000000||High||...||Cardinal Slant-D® Ring Binder, Heavy Gauge Vinyl||Small Box||0.39||1349222400000000000|
|4||85||515||1282953600000000000||Not Specified||...||Holmes HEPA Air Purifier||Medium Box||0.50||1283126400000000000|
5 rows × 21 columns
The managed Hive catalog now contains the new dataset. In order to expose this dataset for self-service BI tools like PowerBI, Tableau, QlikView etc. we must open a ODBC/JDBC Thrift server. Before we do, make sure that the
hive-site.xml (placed in the Spark Classpath) contains the
Fire up the thrift server as follows at a command prompt
spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --hiveconf hive.server2.thrift.port=10000
Now that ODBC server is running on top of the data lake, connect to the lake from BI tools using the ODBC dialect. Client side setup instructions are included in the PDF below.
from IPython.display import IFrame, Image IFrame("./SelfServiceBI.pdf", width=1000, height=1100)
Everyone should be able to build rich interactive BI solutions without intervention by IT. You see above how easy that is.
There is an additional with the big data lake shown above. Take the case of a real corporate travel management company (TMC). Besides the travel servicing and duty-of-care excellence, the TMC prides on providing advanced data analytics and insights to corporate travel managers. Yet, one feedback that is constant is --
Give me my insights quickly/realtime
Give me a direct connection to the master lake so my travel-scientists have transparency to the raw data
Give me the freedom to choose my BI solution; do not jail me into your solution
From the TMC perspective, opening firewalls into the master lake is fraught with security, access control, identity management, performance, policy, and quality challenges. Instead of opening conduits to the master lake that contains sensitive information from multiple clients, here is a innovative solution.
Create micro-replicas (primed with the client-specific data) of the master lake as big data docker appliances. Enable ETL (Pig, Spark), warehouse (Hive, Shark), and BI (Zeppelin, ODBC/JDBC Thrift) on the appliance. Deliver the docker image as a self-service BI appliance to remove cross-client risk and empower concurrent self-service exploration by the clients on a day-to-day basis.