Search This Blog

Friday, February 24, 2017

Azure Data Lake Analytics

Microsoft Azure Data Lake Analytics and Data Lake Store offerings provide an alternative and complimentary solution to Azure HDInsight & Hortonworks HDP.

 Azure Data Lake Analytics (ADLA) provides a U-SQL language (think Pig + SQL + C# + more) based on Microsoft's internal language Scope. Scope is used for tools like Bing Search. It has the same concepts as Hadoop - schema on read, custom reducers, extractors/SerDes, etc.  A component of ADLA is based on Microsoft internal job scheduler and compute engine, Cosmos. ADLA uses Apache YARN to schedule jobs and manage its in-memory components.

 Azure Data Lake Store (ADLS) is a blob storage layer for ADLA, which behaves more like HDFS and uses WebHDFS / Apache Hadoop behind the scenes. ADLA includes the concepts of Tables, Views, Stored Procedures, Table-Valued Functions, Partitions, and stores these types of objects in its internal metastore catalog, similar to Hive.

Currently ADLS supports TSV/CSV format, with extensions for JSON and the ability to write custom extractors against pretty much any format that you could read with .NET or the .Net SDK for Hadoop.

A USQL Script looks something like this:

DECLARE EXTERNAL @inputfile string = "myinputdir/myinputfile"

@indataset = EXTRACT 
col1 as string, 
col2 as int?
FROM @inputfile
USING Extractors.Tsv(skipFirstNRows:1, silent:false);

@outdataset = SELECT 
col1, 
(col2.Length == 0)? 0 : col2 AS isblankcol
FROM @indataset;

OUTPUT @outdataset TO @outputlocation
USING Outputters.Tsv(outputHeader : true, quoting: false);

One problem I have with USQL is the name.  Every search on Google comes back with "We searched for SQL. Did you mean USQL?"

USQL uses C# syntax and .Net data typing, and it includes code-behind and custom assemblies.
A USQL Script job can be submitted either locally for testing or to Azure Data Lake Analytics.  It is a batch process and there is limited interactive functionality.

For those familiar with using hdfs / hadoop commands, there is Python shell development in progress against ADLS with some familiar commands.

cat    chmod  close  du      get   help  ls     mv   quit  rmdir  touch
chgrp  chown  df     exists  head  info  mkdir  put  rm    tail

As with any Azure services, you can also use Azure Xpat Cli, Powershell & Web APIs.