Following the Elephant: 2016

Wednesday, November 2, 2016

What's trending in the world of GitHub and Open Source?

GitHub has a trove of information about its organizations, committers, repos, code and issues.

GitHub Archive maintains per-hour stats on 28 event types hooking into repo activities across the platform. It includes things like a committers login, url, organization, followers, gists, starred repos, and a history of all your public coding activity.

The October 2016 archive table is 29M rows and 70GB of data.

In September 2016, Microsoft became the largest "open-source" contributor organization on Github, largely due to its custom API integration using Azure Services and rather elegant management system for its employees and repos. If you can onboard all developers in a company the size of Microsoft, and automate repository setup and discovery, you will quickly become the largest contributor.

Microsoft beat out , Docker, Angular, Google, Atom, FortAwesome, Elastic, and even Apache.

What's trending in the world of GitHub and Open Source?

Getting the best performance with Pyspark

Some good tips on performance with Pyspark and Dataframes from Holden Karau at Spark Summit 2016

With code
https://github.com/high-performance-spark/high-performance-spark-examples

Saturday, March 26, 2016

Performance and LLAP in Hive

Hive 2.0 introduces LLAP (Live Long and Process) functionality. LLAP is a part of the Stinger.next initiative to address sub-second response times for interactive analytic queries.

The proposal for this feature is here.
https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf

Interactive query response times are important when business intelligence tools directly query the Hive metastore.

When you execute a query in database engines like SQL Server or Oracle, the first time it can be expensive to run. Once the cache is warmed up, speed can increase dramatically. This problem rears its head frequently with poor or non-reusable query execution plans that require the engine to go to disk and scan tables for every query rather than efficiently reusing plans and data caches. System configurations, indexing strategies and statistics all contribute to the performance puzzle.

When you run a Hive distributed query using the Tez engine, it may spin up containers in YARN to process data in the cluster. This process is relatively expensive to start up, and even though there is an option for Tez container re-use it isn't really caching fragments of the results or query access patterns for use across multiple sessions like SQL Server and other relational database engines provide.

There are many actions happening in the background, and it really doesn't make sense to do most of these actions for every interactive query. JIT Optimization isn't really effective unless the Java process sticks around for awhile.

LLAP introduces optional daemons (long-running processes) on worker nodes to facilitate improvements to I/O, caching, and query fragment execution. To reduce the complexity of installing the daemons on nodes, Slider can be used to distribute LLAP in the cluster as a long-running YARN application.

LLAP offers parallel execution of query fragments from different queries and sessions.

Metadata is cached in-memory on-heap, data is cached in column-chunks and persisted off heap, with YARN being responsible for management and allocation of resources.

More information

Stinger Next

http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/

Hadoop Summit 2015
http://www.slideshare.net/Hadoop_Summit/llap-longlived-execution-in-hive
Bay Area Hive Contributor Meetup Presentation.

Build LLAP and launch in a Slider container on HDP 2.3
https://gist.github.com/abajwa-hw/64bd19e3c93de97b73c6
https://www.snip2code.com/Snippet/832252/Build-LLAP-on-HDP-2-3

Sunday, March 6, 2016

Connection refused when starting MySQL

This appears to be a common issue with MySQL not accepting remote connections and cropped up for me a couple of times when installing Hortonworks HDP 2.4 and trying to use an existing MySQL for the Ambari database, Hive Metastore, Oozie and other Hadoop services.

Some steps taken to address the issue.

Confirm root access to mysql
https://www.digitalocean.com/community/questions/restoring-root-access-privileges-to-mysql

Check for running mysql processes and kill any that are running.

ps -A | grep mysql

Grant Remote Access
Change /etc/my.cnf adding a bind-address and port.
#/etc/my.cnf
bind-address=0.0.0.0 # this can be a static address if available.
port=3306

Restart service, in my case MariaDB on Centos7.

systemctl start mariadb

Check the log for errors.

cat /var/log/mariadb/mariadb.log

160306 12:04:52 [Note] /usr/libexec/mysqld: ready for connections.

Version: '5.5.44-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server

Create the Oozie and Hive Users & Databases.

Spin up the Hive Metastore. Ambari will do this with a service restart or can test manually.
export HIVE_CONF_DIR=/usr/hdp/current/hive-metastore/conf/conf.server ; /usr/hdp/current/hive-metastore/bin/schematool -initSchema -dbType mysql -userName hive -passWord <enter_hive_password_here> -verbose

Helpful links

http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.0/bk_ambari_reference_guide/content/_using_hive_with_mysql.html

http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.0/bk_ambari_reference_guide/content/_using_oozie_with_mysql.html

Saturday, February 20, 2016

Hive 2.0 includes HPL/SQL

HPL/SQL (formerly PL/HQL) is a language translation and execution layer developed by Dmitry Tolpeko. It was introduced into the Hive source code in June, 2015 (JIRA-11055) and included this February, 2016 in Hive 2.0. However, it doesn't need Hive to function.

http://www.hplsql.org/

Let me introduce PL/HQL, an open source tool that implements procedural SQL
can be used with any SQL-on-Hadoop solution.
Motivation:
- Writing the driver code using well-known procedural SQL (not bash)
that enables Hadoop to even more wider audience
- Allowing dynamic SQL, iterations, flow-of-control and SQL exception
handling
- Facilitating migration of RDBMS workload to Hadoop
Plans (besides extending syntax):
- Supporting CREATE PROCEDURE/FUNCTION/PACKAGE to reuse code
- Allowing connections to multiple databases (i.e. lookup tables in
relational databases)
- On-the-fly SQL conversion (SELECT i.e.), compatibility layer

Current steps to install in a Hortonworks HDP 2.3.2 environment. Substitute for the version you are using.

Download and Install

wget http://www.hplsql.org/downloads/hplsql-0.3.13.tar.gz

tar xvf hplsql-0.3.13.tar.gzz /usr/hdp/2.3.2.0-2950/

ln -s /usr/hdp/2.3.2.0-2950/hplsql-0.3.13/ /usr/hdp/current/hplsql

Configure HADOOP_CLASSPATH
Edit /usr/hdp/current/hplsql/hplsql
Replace /usr/lib/ with /usr/hdp/2.3.2.0-2950/

Add to Path (in this case globally)

echo "PATH=${PATH}:/usr/hdp/current/hplsql" > /etc/profile.d/hplsql-path.sh && chmod 755 /etc/profile.d/hplsql-path.sh

Configure plhql-site.xml

To configure Hive connection settings, and connectivity to other databases (mySQL, Teradata, IBM DB/2, Oracle, MSSQL)

http://www.hplsql.org/configuration

Test from Command Line

hplsql --version

Get the previous date:

START=$(hplsql -e 'CURRENT_DATE - 1')

Copy table to a file:

COPY (SELECT id, name FROM sales.users WHERE local_dt = CURRENT_DATE) 
  TO /data/users.txt DELIMITER '\t';

Copy table from default connection (Hive) to Teradata connection

COPY sales.users TO sales.users2 AT tdconn;

Log to mySQL from Hive

MAP OBJECT log TO log.log_data AT mysqlconn;
 
DECLARE cnt INT;
SELECT count(*) INTO cnt FROM sales.users WHERE local_dt = CURRENT_DATE;
 
INSERT INTO log (message) VALUES ('Number of users: ' || cnt);

Compare Hive table totals to mySQL

CMP SUM sales.users WHERE local_dt = CURRENT_DATE, users_daily AT mysqlconn;

Great addition to the Hive codebase.

Thursday, January 14, 2016

Working with Jupyter Notebooks

The iPython Notebook and its offshoots, Jupyter, Zeppelin, Spark, etc. are very useful for learning, data science, collaboration, data visualization, and instant information using a REPL (Read-Eval-Print-Loop) interface. A REPL allows you to run and compile code line-by-line, and in the case of Spark and other Hadoop tools run code against a cluster of machines.

A good history of the iPython notebook from Fernando Perez, creator of iPython.

"We coded frantically in parallel: one of us wrote the kernel and the other the client, and we'd debug one of them while leaving the other running in the meantime. It was the perfect blend of pair programming and simultaneous development, and in just two days we had a prototype of a python shell over zmq working."

As of this writing, Jupyter, the latest incarnation of iPython's notebook, has over 50 interpreters to parse and compile code within a notebook interface.

Further to the last blog post, search GitHub for Jupyter Notebooks on GitHub to see more examples.
filename:ipynb

https://github.com/search?l=jupyter-notebook&q=filename%3Aipynb&type=Code&utf8=%E2%9C%93

Here are some interesting examples on Github:

Parsing Apache Logs with Spark
Interactive C# Notebook
Predicting Airline Delays with Pig and Python

Binder is just one host of notebooks, an example using CERN's ROOT framework to run C++ in a browser.
http://app.mybinder.org/2191543109/notebooks/index.ipynb
http://app.mybinder.org/2191543109/notebooks/notebooks/ROOT_Example.ipynb

Sunday, January 3, 2016

Configs and GitHub Viz

In the case of open-source projects, you may need to dig further into what a particular configuration setting does. If the documentation does not give you enough detailed information on the implementation, you can also trace the configuration details by searching for the file or getting the project from Github or SVN.

Some of the common Hadoop projects configuration code.

Pig configuration
Hive configuration
Sqoop configuration
Flume configuration
Kafka configuration

Github allows you to scope your searches which is useful for narrowing down your search to specific files.

Searching code is documented here. In the search box, you can search by filename:<myfile> or <myfile> in:path to track down particular files. You can also search by language, this searches for Scala files.

At the time of this writing, there's some interesting stats available just by looking at language of repositories in Github

There are 1.5m Java repos with ElasticSearch, Android Universal Image Loader and Reactive Extensions for the JVM showing up as the top 3 best matches.
There are nearly 900k Python repos with httpie, Flask, the Django framework and the Awesome Python library coming in the top 4 best matches.
There are 400k C# repos with the .NET framework, SignalR and Mono in the top 3.
There are 421k C repos with Linux being the best match.
There are 60k Scala repos with PredictionIO, the Play Framework and Scala itself showing up as top 3 best matches.

Much cooler than just searching is the GitHub Visualizer created by Artem Zukov using D3js.

Apache's visualization shows an assortment of languages in their repos.

The Hive repo's contributors and file extensions

Search This Blog

Wednesday, November 2, 2016

What's trending in the world of GitHub and Open Source?

What's trending in the world of GitHub and Open Source?

Wednesday, June 22, 2016

Getting the best performance with Pyspark

Saturday, March 26, 2016

Performance and LLAP in Hive

More information

Sunday, March 6, 2016

Connection refused when starting MySQL

Saturday, February 20, 2016

Hive 2.0 includes HPL/SQL

Thursday, January 14, 2016

Working with Jupyter Notebooks

Sunday, January 3, 2016

Configs and GitHub Viz