Search This Blog

Friday, November 6, 2015

Closer look at U-SQL, MIcrosoft's HiveQL

Microsoft U-SQL is the query language used on Azure Data Lake Analytics services.  Based on SCOPE and Cosmos, which has been around since at least 2008, It combines C# type / expressions functionality, schema-on-read, custom processors and reducers into a SQL-like ETL and output language.

Keywords need to be upper case.  The where clause uses C#-style == syntax.  Rows can contain up to 4MB of data per row.

U-SQL supports SQL.MAP<k,v> and SQL.ARRAY(<T>)

U-SQL supports inline C# expressions, UDFs, UDAs to custom aggregate, UDOs to generate process and consume rowsets.

U-DOs are user-defined operators build with Visual Studio.
https://azure.microsoft.com/pt-pt/documentation/articles/data-lake-analytics-u-sql-develop-user-defined-operators/

It will be interesting to see if this language makes it into SQL Server itself.  Extractors and Outputters would be highly useful to replace some of the functionality of SSIS.

I built a similar tool a few years ago for schema-on-read.  It brought CSV files into BLOB columns in SQL Server (read my article on BLOBs on SQL Server Central)  and allowed you to query them by converting to nvarchar(max), applying a schema, and then outputting to a table.

Kind of felt like a data lake at the time.... though it wasn't massively parallel and didn't have any kind of map-reduce job spinning up.  Then MS introduced the filestream object...