Search This Blog

Monday, June 23, 2014

Presentations from the Apache Accumulo Summit 2014

"Up to 10 quadrillion entries in a single table"

That's 10,000,000,000,000,000 rows.

Sounds like a limitation to me...

Presentations from the Accumulo Summit.  Accumulo is the Apache implementation of Google BigTable.  http://www.slideshare.net/AccumuloSummit

Information on Hawq & the Accumulo Connector, Ambari, Slider, YARN, TinkerPop, etc.

The TinkerPop stack with Blueprints is my favourite project suite to read about, if only because of the cartoon mascots in their architecture diagrams.  Every project team needs a graphics designer like Ketrina Yim, improve morale and adoption in the community.  So many projects could benefit from the experiences of a designer rather than a programmer when it comes to building user-friendly applications and branding.  Tech projects often take themselves much too seriously.

Would you rather learn more about Graph Server XI or Rexster?  I thought so...

- Ketrina Yim, TinkerPop stack

Visit for the information, stay for the nice graphs about Accumulo adoption in the community and the 172-slide deck from Aaron Cordova on scaling Accumulo clusters, with lots of examples of truly "Big Data".

  • 1 Year of Large Hadron Collider = 15PB
  • 1 Year of Twitter = 182 trillion tweets & 483TB
  • Netflix master = 3.14 PB (Pie!)
  • WoW = 1.3 PB
  • InternetArchive = 15 PB

That's not big data, This! is big data....



Friday, June 20, 2014

Hadoop'able Materialized Views

The smart teams working on Apache Optiq are promoting in-memory, discardable, materialized views as a potential source of performance improvements when dealing with large distributed datasets in Hadoop.  Why not use up all that memory sitting in your Hadoop cluster?

A presentation on DMMQ here.
http://www.slideshare.net/julianhyde/discardable-inmemory-materialized-queries-with-hadoop

The DMMQ blog at Hortonworks
http://hortonworks.com/blog/dmmq/

The DDM blog at Hortonworks
http://hortonworks.com/blog/ddm/

Monday, June 9, 2014

Querying Hive, the "Microsoft Way"

Apache Hive is an abstraction tool for generation of MapReduce jobs in Hadoop, and a lightweight data warehousing tool providing schema on read capabilities and storage of metadata in its metastore.  By default, it is stored in MySQL.  In Microsoft Azure HDInsight, it is stored in Azure SQL.

Using a "SQL-like" HiveQL language you can write queries that can access data stored in a Hadoop cluster, either within the Hive warehouse (predefined metadata) or in external files (text or binary).

Microsoft has LINQtoHive support through the Hadoop SDK, for those developers who enjoy using LINQ as an abstraction to their data.

Go get LinqPad and try it out!

If you're lucky enough to already have the LinqPad Premium edition, you can do a NuGet on the assembly required directly from the Query Properties pane.

You'll need the following assemblies for this demo query.  For testing purposes, I just installed them using Nuget in Visual Studio, then browsed to the folder containing the assembly in LinqPad.

install-package Microsoft.Hadoop.Hive 
Install-Package Microsoft.AspNet.WebApi.Client -Version 4.0.20710
Install-Package Newtonsoft.Json

Once you've added the assemblies, you can run this C# statement, after replacing the URL, User ID & Password.  The port is the WebHCat port where Hive / HCatalog is available.

var db = new HiveConnection(
            webHCatUri: new Uri("http://<myhadoopclusterurl>:50111"),
            userName: (string) "<myuserid>", password: (string) "<mypassword>");

var result = db.ExecuteHiveQuery("select * from access_logs");
result.Wait();

LinqPad is awesomeness...

Hadoop Summit 2014 presentations

Slides and presentations from the Hadoop Summit 2014 in San Jose are here.

To me, the most fascinating was Hadoop 2 @Twitter Elephant Scale and the size of the data being worked upon during the migration.

Sunday, May 18, 2014

Buzz about Hive - ACID Support and Query Optimization

Apache Hive™ is a distributed Data Warehousing solution for Hadoop that includes a HiveQL language for a SQL-like experience.  It's not SQL, nor an Oracle, SQL or Teradata warehouse.

The Hadoop Platforms Team at Yahoo! has announced they are backing Hive (coming from Facebook) and the features of Full ACID Support and Cost-based Query Optimizations. Features like these would bring Hive closer to the world of relational databases, with all the benefits of being a large-scale distributed data store capable of holding structured, semi-structured and unstructured data.

It's fascinating to see the things that we take for granted in relational databases being built from the ground up in other tools such as Hive, with the community discussing problems that were resolved in many other database platforms 20+ years ago. 

If you want to compare SQL or Oracle to Hive, it's probably best not to. Hive doesn't include many of the features found in the more mature database platforms.  Queries can take some time, even (and especially) with tiny datasets, and are designed for batch processes. What it has going for it is cost of storage and distribution of workload. 

DB-Engines Ranking has Hive currently at #18 on its popularity ranking chart for databases.  This ranking should be taken with a grain of salt, as the list compares apples to oranges.  Would you rank Microsoft Access with Oracle on the same page?  They are different systems with different purposes, audiences, and scalability features. 

Ignoring all that, Hive would be ranked #12 if only classified with the Relational Database category. That puts it ahead of SAP HANA and dBase.  Did I just say dBase?  Yes I did.  You can buy it for DOS with the original 1994 documentation for $99.  And a DOS emulator to run it on.  I may have a copy in my basement I can sell you too... along with my Intellivision.

The DB-Engines site classifies Hive as a Relational Database, which it is not.  A relational database defines a primary key for a relationship within a table, and foreign keys in related tables for associating back to said primary key.  Hive currently has no concept of primary keys or relationships, which gets me a bit stressed about manageability of data. 

Something that us DBAs take for granted such as Oracle Sequences or SQL Server Auto-identity columns doesn't exist in Hive.

It's only a matter of time though...

Here some of the JIRAs related to ACID Support and Cost-based Query Optimizations. 

Cost-based Query Optimizations
https://issues.apache.org/jira/browse/HIVE-1938

Relationships & Sequences
https://issues.apache.org/jira/browse/HIVE-6905

ACID Updates
https://issues.apache.org/jira/browse/HIVE-5317

In other news, Qubole, founded last year by some of the members of the Facebook Hive team, has announced Presto as a service, another SQL language for Hadoop that operates 10x faster than Hive, or at least that's the quoted marketing metric. 

Thursday, May 15, 2014

115 Projects and Mountains of Data

Have you heard of Hadoop?  Sure you have.  You're reading this, aren't you?

My colleague, who is going for every Hadoop certification available, has kindly provided some links to add to my 150MB OneNote notebook on the Hadoop and Apache ecosystems.  That's about 1.8 blocks of HDFS data (replicated 3 times) if you haven't adjusted the default size and are using MR2.  I'll try to share some of them on this site.


The list of projects out there doesn't quite qualify as big data but is still getting pretty unmanageable for me.  Apache alone has 115 projects listed, though some are shelved and haven't been updated in awhile, and only about 11 are categorized as "Big Data."


I'm currently pursuing one certification for now, and focusing a bit more on some of the amazing tools out there that work with the core infrastructure.  I will try to share some of my findings on this blog for anyone who might find it helpful.


If you're going to get certified in the core of Hadoop, you'll want to understand Java programming and MapReduce theory. This could change in the future, as MapReduce slowly gets relegated to the mines of Mordor, with YARN treating it as a tenant in a larger domain of heterogeneous applications.  The possibility of running different MR versions, or even doing away with MR and going with one of the other 7 Dwarves (or perhaps 13) as a core piece of the architecture is a serious concern.


Speaking of Mordor, an Oliphaunt is a large war elephant from Lord of the Rings.  



The New York Times has an article from 1984 called "The Mystery of Hannibal's Elephants."   Hannibal had a 38-node cluster of War Elephants, and crossed the alps with those elephants and 100,000 men (give or take 60,000 or so, Wikipedia has a different number).

There are currently 129 people considered as Apache committers who contribute to > 10 projects each.  That's about 3% of the 3500 or so committers listed on the Apache site.  The top two committers, Jim Jagielski and Dr. Chris Mattmann have contributed to at least 35 different projects each.  The Apache ecosystem is an amazing community with some very dedicated and passionate individuals.  However, there is an even larger "dark pool" of talent branching and forking open-source code for their own needs within the silos of companies like Twitter, Intel, eBay, Linkedin, IBM, Facebook, Google, Yahoo, and yes even Microsoft.  


The cute elephant in the room of 2006 is turning into a herd of war elephants that will crush relational database systems as we know them.

Or so they say...



I will either find a way, or make one.
-- Hannibal