Search This Blog

Sunday, May 18, 2014

Buzz about Hive - ACID Support and Query Optimization

Apache Hive™ is a distributed Data Warehousing solution for Hadoop that includes a HiveQL language for a SQL-like experience.  It's not SQL, nor an Oracle, SQL or Teradata warehouse.

The Hadoop Platforms Team at Yahoo! has announced they are backing Hive (coming from Facebook) and the features of Full ACID Support and Cost-based Query Optimizations. Features like these would bring Hive closer to the world of relational databases, with all the benefits of being a large-scale distributed data store capable of holding structured, semi-structured and unstructured data.

It's fascinating to see the things that we take for granted in relational databases being built from the ground up in other tools such as Hive, with the community discussing problems that were resolved in many other database platforms 20+ years ago. 

If you want to compare SQL or Oracle to Hive, it's probably best not to. Hive doesn't include many of the features found in the more mature database platforms.  Queries can take some time, even (and especially) with tiny datasets, and are designed for batch processes. What it has going for it is cost of storage and distribution of workload. 

DB-Engines Ranking has Hive currently at #18 on its popularity ranking chart for databases.  This ranking should be taken with a grain of salt, as the list compares apples to oranges.  Would you rank Microsoft Access with Oracle on the same page?  They are different systems with different purposes, audiences, and scalability features. 

Ignoring all that, Hive would be ranked #12 if only classified with the Relational Database category. That puts it ahead of SAP HANA and dBase.  Did I just say dBase?  Yes I did.  You can buy it for DOS with the original 1994 documentation for $99.  And a DOS emulator to run it on.  I may have a copy in my basement I can sell you too... along with my Intellivision.

The DB-Engines site classifies Hive as a Relational Database, which it is not.  A relational database defines a primary key for a relationship within a table, and foreign keys in related tables for associating back to said primary key.  Hive currently has no concept of primary keys or relationships, which gets me a bit stressed about manageability of data. 

Something that us DBAs take for granted such as Oracle Sequences or SQL Server Auto-identity columns doesn't exist in Hive.

It's only a matter of time though...

Here some of the JIRAs related to ACID Support and Cost-based Query Optimizations. 

Cost-based Query Optimizations
https://issues.apache.org/jira/browse/HIVE-1938

Relationships & Sequences
https://issues.apache.org/jira/browse/HIVE-6905

ACID Updates
https://issues.apache.org/jira/browse/HIVE-5317

In other news, Qubole, founded last year by some of the members of the Facebook Hive team, has announced Presto as a service, another SQL language for Hadoop that operates 10x faster than Hive, or at least that's the quoted marketing metric. 

No comments:

Post a Comment