Following the Elephant: Performance and LLAP in Hive

Saturday, March 26, 2016

Performance and LLAP in Hive

Hive 2.0 introduces LLAP (Live Long and Process) functionality. LLAP is a part of the Stinger.next initiative to address sub-second response times for interactive analytic queries.

The proposal for this feature is here.
https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf

Interactive query response times are important when business intelligence tools directly query the Hive metastore.

When you execute a query in database engines like SQL Server or Oracle, the first time it can be expensive to run. Once the cache is warmed up, speed can increase dramatically. This problem rears its head frequently with poor or non-reusable query execution plans that require the engine to go to disk and scan tables for every query rather than efficiently reusing plans and data caches. System configurations, indexing strategies and statistics all contribute to the performance puzzle.

When you run a Hive distributed query using the Tez engine, it may spin up containers in YARN to process data in the cluster. This process is relatively expensive to start up, and even though there is an option for Tez container re-use it isn't really caching fragments of the results or query access patterns for use across multiple sessions like SQL Server and other relational database engines provide.

There are many actions happening in the background, and it really doesn't make sense to do most of these actions for every interactive query. JIT Optimization isn't really effective unless the Java process sticks around for awhile.

LLAP introduces optional daemons (long-running processes) on worker nodes to facilitate improvements to I/O, caching, and query fragment execution. To reduce the complexity of installing the daemons on nodes, Slider can be used to distribute LLAP in the cluster as a long-running YARN application.

LLAP offers parallel execution of query fragments from different queries and sessions.

Metadata is cached in-memory on-heap, data is cached in column-chunks and persisted off heap, with YARN being responsible for management and allocation of resources.