Adaltas Adaltas

Hive, Calcite and Druid

Hive, Calcite and Druid

Hive, Calcite and Druid

David WORMS

Jul 14, 2016

Categories: Big Data
Tags: Business intelligence; Database; Druid; Hadoop; Hive

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

BI/OLAP requires interactive visualization of complex data streams:

Real time bidding events
User activity streams
Voice call logs
Network trafic flows
Firewall events
Application KPIs

Traditionnal solutions

RDBMS (Mysql..): don’t scale, need caching but adhoc queries remain slow
Key/value store (HBase…): quick but takes forever to compute (pre-materialization of data)

Context

Created in 2011, open-sourced in 2012
Built for interactive analytics
+150 contributors as of 2016

Main Features

Column oriented
Batch & real-time ingestion
Scalable
Sub-second response time for time-based slice-and-dice

Persistent storage

Segment files partitionned by date
Segment should be around 1GB
If larger, use smaller partitions
Segment consist of: timestamp column, dimension (eg page, username, city), metrics (for aggregation) and indexes (to speed up the lookup)

Druid + Hive

Indexing complex query result in Druid using Hive
Introduce a SQL interface to Druid
Execute complex operations on Druid data
Efficient OLAP queries in Hive
Demo: based on Hive 2.20, no release plans yet, rely on Druid 0.9.1.1 and Apache Calcite 1.10.0, (master and awaiting 0.9.2)

Druid data source in Hive

How to index data in Hive, 2 ways

Register Druid data sources in Hive when data already in druid
STORED BY 'org.apache.hadoop.hive.druid.DruidSTorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker")
Create druid data sources
STORED BY 'org.apache.hadoop.hive.druid.DruidSTorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker" AS SELECT __time, page, user, c_added, c_removed FROM src;
“_time” is timestamp, “page, user” are dimensions, “c*” are metrics

Need to partition the data by date granularity.

Query Druid data sources from Hive

Automatic rewriting with Calcite
Use a Druid Input Format to fetch records into the Hive pipeline
Will bypass the broker

Roadmap

Push more computation to Druid (push down optimization)
Parallelize results retrieval (bypass broker)
Denormalize start schema
Huge perspectives for materialized views in Hive with automatic input query rewriting

Share this article

Support Ukrain