Search This Blog

Saturday, December 16, 2017

IBM and Hortonworks Consolidate Offerings at DataWorks Summit

At DataWorks Summit this year, a few announcements were made.  One in particular further consolidates the Hadoop distributions and makes Hortonworks Data Platform (HDP) an even more compelling offering.

https://hortonworks.com/press-releases/ibm-hortonworks-expand-partnership/

IBM and Hortonworks are both members of the ODPi, and now they are offering IBM Data Science Experience and IBM Big SQL as packaged offerings with HDP.

In addition, IBM is migrating BigInsights customers to HDP, consolidating IBM BigIntegrate, IBM BigQuality, and IBM Information Governance Catalog into Apache Atlas, and continuing to contribute to open source platforms including Apache Spark and SystemML.

IBM has at least 4 official Apache Spark committers with 2 official committers from Hortonworks.  When I looked at this list in April, 2014, neither company had committers.  The list of committers has almost doubled since then.  Mridul Muralidharam joined Hortonworks from Yahoo!, Nick Pentreath joined IBM from Mxit, Prashant Sharma joined IBM from Databricks.

IBM, Databricks, and Hortonworks are by far the top contributing companies to PySpark 2.0.  Two years ago IBM went all-in on Spark, calling it "Potentially the Most Significant Open Source Project of the Next Decade"

Another announcement was the inclusion of Hortonworks Registry for Kafka, Storm and NiFi.  Similar to https://github.com/confluentinc/schema-registry it distinguishes itself from the competition by providing pluggable storage of schemas in MySql or Postgres, a web-based UI, search capabilities.

The question that popped into my head right away is why didn't they just extend the Hive metastore to become the Schema Registry for all things streaming, and provide tumbling windows on Kafka and Storm from Hive?  This would have been an awesome addition to the Hive StorageHandlers.

There's always HiveKa if anyone wants to pick it up...

The latest HDF 3.0 was announced.  One component that brought some excitement was the generically-named Streaming Analytics Manager.  It's gui-based design is a bit similar to NiFi, with the addition of Dashboards, the aforementioned Schema Registry, and monitoring views.  This tool tries to democratize the creation and managment of streaming data sources.

Data in motion is the story of 2017 and beyond.


Spark Classes and Resources

There's a-lot of material available for Spark MLlib (RDD based API) - this API may be deprecated with next release i.e. 2.3 ....
https://cognitiveclass.ai/courses/spark-mllib/

Spark ML is Dataframes based API - there are less training resources than core Spark  - MOOCs on edx/datacamp/udemy

Spark ML training at Strata  (full videos are available on safaribooksonline.com) and few more on safari from various authors/publications.

Great resource for anything Spark
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-mllib/spark-mllib-pipelines.html

https://mapr.com/training/certification/mcsd/opic-centric list of high-quality open datasets

https://github.com/caesar0301/awesome-public-datasets

Subscribe to Spark email list or review archives. 

http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark+ml&days=0&sort=date
https://spark.apache.org/community.html

Databricks is the founding organization of Spark and largest contributor.
https://databricks.com/training/courses/apache-spark-for-machine-learning-and-data-science

UC Berkeley, Hortonworks, IBM, and Cloudera are other top Spark committers. 

Berkeley has some courses, granddaddy of MLLib.
http://mlbase.org/

Hortonworks
https://hortonworks.com/apache/spark/

IBM
https://www.ibm.com/ca-en/marketplace/spark-as-a-service

Cloudera
https://university.cloudera.com/instructor-led-training/introduction-to-machine-learning-with-spark-ml-and-mllib (paid)

Deep Learning
https://github.com/databricks/spark-deep-learning

Databricks repos
https://github.com/databricks

Spark Roadmap
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-ml-roadmap-2-3-0-and-beyond-td22892.html#a22972


Certifications search on Github
https://github.com/search?l=Markdown&q=spark+ml+certification&type=Code&utf8=%E2%9C%93

Apache Spark Meetups
https://spark.apache.org/community.html

Friday, February 24, 2017

Azure Data Lake Analytics

Microsoft Azure Data Lake Analytics and Data Lake Store offerings provide an alternative and complimentary solution to Azure HDInsight & Hortonworks HDP.

 Azure Data Lake Analytics (ADLA) provides a U-SQL language (think Pig + SQL + C# + more) based on Microsoft's internal language Scope. Scope is used for tools like Bing Search. It has the same concepts as Hadoop - schema on read, custom reducers, extractors/SerDes, etc.  A component of ADLA is based on Microsoft internal job scheduler and compute engine, Cosmos. ADLA uses Apache YARN to schedule jobs and manage its in-memory components.

 Azure Data Lake Store (ADLS) is a blob storage layer for ADLA, which behaves more like HDFS and uses WebHDFS / Apache Hadoop behind the scenes. ADLA includes the concepts of Tables, Views, Stored Procedures, Table-Valued Functions, Partitions, and stores these types of objects in its internal metastore catalog, similar to Hive.

Currently ADLS supports TSV/CSV format, with extensions for JSON and the ability to write custom extractors against pretty much any format that you could read with .NET or the .Net SDK for Hadoop.

A USQL Script looks something like this:

DECLARE EXTERNAL @inputfile string = "myinputdir/myinputfile"

@indataset = EXTRACT 
col1 as string, 
col2 as int?
FROM @inputfile
USING Extractors.Tsv(skipFirstNRows:1, silent:false);

@outdataset = SELECT 
col1, 
(col2.Length == 0)? 0 : col2 AS isblankcol
FROM @indataset;

OUTPUT @outdataset TO @outputlocation
USING Outputters.Tsv(outputHeader : true, quoting: false);

One problem I have with USQL is the name.  Every search on Google comes back with "We searched for SQL. Did you mean USQL?"

USQL uses C# syntax and .Net data typing, and it includes code-behind and custom assemblies.
A USQL Script job can be submitted either locally for testing or to Azure Data Lake Analytics.  It is a batch process and there is limited interactive functionality.

For those familiar with using hdfs / hadoop commands, there is Python shell development in progress against ADLS with some familiar commands.

cat    chmod  close  du      get   help  ls     mv   quit  rmdir  touch
chgrp  chown  df     exists  head  info  mkdir  put  rm    tail

As with any Azure services, you can also use Azure Xpat Cli, Powershell & Web APIs.