Search This Blog

Saturday, December 16, 2017

IBM and Hortonworks Consolidate Offerings at DataWorks Summit

At DataWorks Summit this year, a few announcements were made.  One in particular further consolidates the Hadoop distributions and makes Hortonworks Data Platform (HDP) an even more compelling offering.

https://hortonworks.com/press-releases/ibm-hortonworks-expand-partnership/

IBM and Hortonworks are both members of the ODPi, and now they are offering IBM Data Science Experience and IBM Big SQL as packaged offerings with HDP.

In addition, IBM is migrating BigInsights customers to HDP, consolidating IBM BigIntegrate, IBM BigQuality, and IBM Information Governance Catalog into Apache Atlas, and continuing to contribute to open source platforms including Apache Spark and SystemML.

IBM has at least 4 official Apache Spark committers with 2 official committers from Hortonworks.  When I looked at this list in April, 2014, neither company had committers.  The list of committers has almost doubled since then.  Mridul Muralidharam joined Hortonworks from Yahoo!, Nick Pentreath joined IBM from Mxit, Prashant Sharma joined IBM from Databricks.

IBM, Databricks, and Hortonworks are by far the top contributing companies to PySpark 2.0.  Two years ago IBM went all-in on Spark, calling it "Potentially the Most Significant Open Source Project of the Next Decade"

Another announcement was the inclusion of Hortonworks Registry for Kafka, Storm and NiFi.  Similar to https://github.com/confluentinc/schema-registry it distinguishes itself from the competition by providing pluggable storage of schemas in MySql or Postgres, a web-based UI, search capabilities.

The question that popped into my head right away is why didn't they just extend the Hive metastore to become the Schema Registry for all things streaming, and provide tumbling windows on Kafka and Storm from Hive?  This would have been an awesome addition to the Hive StorageHandlers.

There's always HiveKa if anyone wants to pick it up...

The latest HDF 3.0 was announced.  One component that brought some excitement was the generically-named Streaming Analytics Manager.  It's gui-based design is a bit similar to NiFi, with the addition of Dashboards, the aforementioned Schema Registry, and monitoring views.  This tool tries to democratize the creation and managment of streaming data sources.

Data in motion is the story of 2017 and beyond.


Spark Classes and Resources

There's a-lot of material available for Spark MLlib (RDD based API) - this API may be deprecated with next release i.e. 2.3 ....
https://cognitiveclass.ai/courses/spark-mllib/

Spark ML is Dataframes based API - there are less training resources than core Spark  - MOOCs on edx/datacamp/udemy

Spark ML training at Strata  (full videos are available on safaribooksonline.com) and few more on safari from various authors/publications.

Great resource for anything Spark
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-mllib/spark-mllib-pipelines.html

https://mapr.com/training/certification/mcsd/opic-centric list of high-quality open datasets

https://github.com/caesar0301/awesome-public-datasets

Subscribe to Spark email list or review archives. 

http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark+ml&days=0&sort=date
https://spark.apache.org/community.html

Databricks is the founding organization of Spark and largest contributor.
https://databricks.com/training/courses/apache-spark-for-machine-learning-and-data-science

UC Berkeley, Hortonworks, IBM, and Cloudera are other top Spark committers. 

Berkeley has some courses, granddaddy of MLLib.
http://mlbase.org/

Hortonworks
https://hortonworks.com/apache/spark/

IBM
https://www.ibm.com/ca-en/marketplace/spark-as-a-service

Cloudera
https://university.cloudera.com/instructor-led-training/introduction-to-machine-learning-with-spark-ml-and-mllib (paid)

Deep Learning
https://github.com/databricks/spark-deep-learning

Databricks repos
https://github.com/databricks

Spark Roadmap
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-ml-roadmap-2-3-0-and-beyond-td22892.html#a22972


Certifications search on Github
https://github.com/search?l=Markdown&q=spark+ml+certification&type=Code&utf8=%E2%9C%93

Apache Spark Meetups
https://spark.apache.org/community.html