Search This Blog

Monday, June 9, 2014

Querying Hive, the "Microsoft Way"

Apache Hive is an abstraction tool for generation of MapReduce jobs in Hadoop, and a lightweight data warehousing tool providing schema on read capabilities and storage of metadata in its metastore.  By default, it is stored in MySQL.  In Microsoft Azure HDInsight, it is stored in Azure SQL.

Using a "SQL-like" HiveQL language you can write queries that can access data stored in a Hadoop cluster, either within the Hive warehouse (predefined metadata) or in external files (text or binary).

Microsoft has LINQtoHive support through the Hadoop SDK, for those developers who enjoy using LINQ as an abstraction to their data.

Go get LinqPad and try it out!

If you're lucky enough to already have the LinqPad Premium edition, you can do a NuGet on the assembly required directly from the Query Properties pane.

You'll need the following assemblies for this demo query.  For testing purposes, I just installed them using Nuget in Visual Studio, then browsed to the folder containing the assembly in LinqPad.

install-package Microsoft.Hadoop.Hive 
Install-Package Microsoft.AspNet.WebApi.Client -Version 4.0.20710
Install-Package Newtonsoft.Json

Once you've added the assemblies, you can run this C# statement, after replacing the URL, User ID & Password.  The port is the WebHCat port where Hive / HCatalog is available.

var db = new HiveConnection(
            webHCatUri: new Uri("http://<myhadoopclusterurl>:50111"),
            userName: (string) "<myuserid>", password: (string) "<mypassword>");

var result = db.ExecuteHiveQuery("select * from access_logs");
result.Wait();

LinqPad is awesomeness...

No comments:

Post a Comment