GitHub has a trove of information about its organizations, committers, repos, code and issues.
GitHub Archive maintains per-hour stats on 28 event types hooking into repo activities across the platform. It includes things like a committers login, url, organization, followers, gists, starred repos, and a history of all your public coding activity.
The October 2016 archive table is 29M rows and 70GB of data.
In September 2016, Microsoft became the largest "open-source" contributor organization on Github, largely due to its custom API integration using Azure Services and rather elegant management system for its employees and repos. If you can onboard all developers in a company the size of Microsoft, and automate repository setup and discovery, you will quickly become the largest contributor.
Microsoft beat out , Docker, Angular, Google, Atom, FortAwesome, Elastic, and even Apache.
Search This Blog
Wednesday, November 2, 2016
What's trending in the world of GitHub and Open Source?
GitHub has a trove of information about its organizations, committers, repos, code and issues.
GitHub Archive maintains per-hour stats on 28 event types hooking into repo activities across the platform. It includes things like a committers login, url, organization, followers, gists, starred repos, and a history of all your public coding activity.
The October 2016 archive table is 29M rows and 70GB of data.
In September 2016, Microsoft became the largest "open-source" contributor organization on Github, largely due to its custom API integration using Azure Services and rather elegant management system for its employees and repos. If you can onboard all developers in a company the size of Microsoft, and automate repository setup and discovery, you will quickly become the largest contributor.
Microsoft beat out , Docker, Angular, Google, Atom, FortAwesome, Elastic, and even Apache.
GitHub Archive maintains per-hour stats on 28 event types hooking into repo activities across the platform. It includes things like a committers login, url, organization, followers, gists, starred repos, and a history of all your public coding activity.
The October 2016 archive table is 29M rows and 70GB of data.
In September 2016, Microsoft became the largest "open-source" contributor organization on Github, largely due to its custom API integration using Azure Services and rather elegant management system for its employees and repos. If you can onboard all developers in a company the size of Microsoft, and automate repository setup and discovery, you will quickly become the largest contributor.
Microsoft beat out , Docker, Angular, Google, Atom, FortAwesome, Elastic, and even Apache.
Wednesday, June 22, 2016
Getting the best performance with Pyspark
Some good tips on performance with Pyspark and Dataframes from Holden Karau at Spark Summit 2016
With code
https://github.com/high-performance-spark/high-performance-spark-examples
With code
https://github.com/high-performance-spark/high-performance-spark-examples
Saturday, March 26, 2016
Performance and LLAP in Hive
Hive 2.0 introduces LLAP (Live Long and Process) functionality. LLAP is a part of the Stinger.next initiative to address sub-second response times for interactive analytic queries.
The proposal for this feature is here.
https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf
Interactive query response times are important when business intelligence tools directly query the Hive metastore.
When you execute a query in database engines like SQL Server or Oracle, the first time it can be expensive to run. Once the cache is warmed up, speed can increase dramatically. This problem rears its head frequently with poor or non-reusable query execution plans that require the engine to go to disk and scan tables for every query rather than efficiently reusing plans and data caches. System configurations, indexing strategies and statistics all contribute to the performance puzzle.
When you run a Hive distributed query using the Tez engine, it may spin up containers in YARN to process data in the cluster. This process is relatively expensive to start up, and even though there is an option for Tez container re-use it isn't really caching fragments of the results or query access patterns for use across multiple sessions like SQL Server and other relational database engines provide.
There are many actions happening in the background, and it really doesn't make sense to do most of these actions for every interactive query. JIT Optimization isn't really effective unless the Java process sticks around for awhile.
LLAP introduces optional daemons (long-running processes) on worker nodes to facilitate improvements to I/O, caching, and query fragment execution. To reduce the complexity of installing the daemons on nodes, Slider can be used to distribute LLAP in the cluster as a long-running YARN application.
LLAP offers parallel execution of query fragments from different queries and sessions.
Metadata is cached in-memory on-heap, data is cached in column-chunks and persisted off heap, with YARN being responsible for management and allocation of resources.
Hadoop Summit 2015
http://www.slideshare.net/Hadoop_Summit/llap-longlived-execution-in-hive
Bay Area Hive Contributor Meetup Presentation.
Build LLAP and launch in a Slider container on HDP 2.3
https://gist.github.com/abajwa-hw/64bd19e3c93de97b73c6
https://www.snip2code.com/Snippet/832252/Build-LLAP-on-HDP-2-3
The proposal for this feature is here.
https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf
Interactive query response times are important when business intelligence tools directly query the Hive metastore.
When you execute a query in database engines like SQL Server or Oracle, the first time it can be expensive to run. Once the cache is warmed up, speed can increase dramatically. This problem rears its head frequently with poor or non-reusable query execution plans that require the engine to go to disk and scan tables for every query rather than efficiently reusing plans and data caches. System configurations, indexing strategies and statistics all contribute to the performance puzzle.
When you run a Hive distributed query using the Tez engine, it may spin up containers in YARN to process data in the cluster. This process is relatively expensive to start up, and even though there is an option for Tez container re-use it isn't really caching fragments of the results or query access patterns for use across multiple sessions like SQL Server and other relational database engines provide.
There are many actions happening in the background, and it really doesn't make sense to do most of these actions for every interactive query. JIT Optimization isn't really effective unless the Java process sticks around for awhile.
LLAP introduces optional daemons (long-running processes) on worker nodes to facilitate improvements to I/O, caching, and query fragment execution. To reduce the complexity of installing the daemons on nodes, Slider can be used to distribute LLAP in the cluster as a long-running YARN application.
LLAP offers parallel execution of query fragments from different queries and sessions.
Metadata is cached in-memory on-heap, data is cached in column-chunks and persisted off heap, with YARN being responsible for management and allocation of resources.
More information
Stinger Next
http://www.slideshare.net/Hadoop_Summit/llap-longlived-execution-in-hive
Bay Area Hive Contributor Meetup Presentation.
Build LLAP and launch in a Slider container on HDP 2.3
https://gist.github.com/abajwa-hw/64bd19e3c93de97b73c6
https://www.snip2code.com/Snippet/832252/Build-LLAP-on-HDP-2-3
Sunday, March 6, 2016
Connection refused when starting MySQL
This appears to be a common issue with MySQL not accepting remote connections and cropped up for me a couple of times when installing Hortonworks HDP 2.4 and trying to use an existing MySQL for the Ambari database, Hive Metastore, Oozie and other Hadoop services.
Some steps taken to address the issue.
Confirm root access to mysql
https://www.digitalocean.com/community/questions/restoring-root-access-privileges-to-mysql
Grant Remote Access
Change /etc/my.cnf adding a bind-address and port.
#/etc/my.cnf
bind-address=0.0.0.0 # this can be a static address if available.
port=3306
Spin up the Hive Metastore. Ambari will do this with a service restart or can test manually.
export HIVE_CONF_DIR=/usr/hdp/current/hive-metastore/conf/conf.server ; /usr/hdp/current/hive-metastore/bin/schematool -initSchema -dbType mysql -userName hive -passWord <enter_hive_password_here> -verbose
Some steps taken to address the issue.
Confirm root access to mysql
https://www.digitalocean.com/community/questions/restoring-root-access-privileges-to-mysql
Check for running mysql processes and kill any that are running.
ps -A | grep mysqlGrant Remote Access
Change /etc/my.cnf adding a bind-address and port.
#/etc/my.cnf
bind-address=0.0.0.0 # this can be a static address if available.
port=3306
Restart service, in my case MariaDB on Centos7.
systemctl start mariadb
Check the log for errors.
cat /var/log/mariadb/mariadb.log
160306 12:04:52 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.44-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server
Create the Oozie and Hive Users & Databases.
Spin up the Hive Metastore. Ambari will do this with a service restart or can test manually.
export HIVE_CONF_DIR=/usr/hdp/current/hive-metastore/conf/conf.server ; /usr/hdp/current/hive-metastore/bin/schematool -initSchema -dbType mysql -userName hive -passWord <enter_hive_password_here> -verbose
Helpful links
Saturday, February 20, 2016
Hive 2.0 includes HPL/SQL
HPL/SQL (formerly PL/HQL) is a language translation and execution layer developed by Dmitry Tolpeko. It was introduced into the Hive source code in June, 2015 (JIRA-11055) and included this February, 2016 in Hive 2.0. However, it doesn't need Hive to function.
Let me introduce PL/HQL, an open source tool that implements procedural SQL
can be used with any SQL-on-Hadoop solution.
Motivation:
- Writing the driver code using well-known procedural SQL (not bash)
that enables Hadoop to even more wider audience
- Allowing dynamic SQL, iterations, flow-of-control and SQL exception
handling
- Facilitating migration of RDBMS workload to Hadoop
Plans (besides extending syntax):
- Supporting CREATE PROCEDURE/FUNCTION/PACKAGE to reuse code
- Allowing connections to multiple databases (i.e. lookup tables in
relational databases)
- On-the-fly SQL conversion (SELECT i.e.), compatibility layer
Current steps to install in a Hortonworks HDP 2.3.2 environment. Substitute for the version you are using.
Download and Install
tar xvf
hplsql-0.3.13.tar.gzz /usr/hdp/2.3.2.0-2950/
ln -s
/usr/hdp/2.3.2.0-2950/hplsql-0.3.13/ /usr/hdp/current/hplsqlConfigure HADOOP_CLASSPATH
Edit /usr/hdp/current/hplsql/hplsql
Replace /usr/lib/ with /usr/hdp/2.3.2.0-2950/
Add to Path (in this case globally)
echo
"PATH=${PATH}:/usr/hdp/current/hplsql" >
/etc/profile.d/hplsql-path.sh && chmod 755
/etc/profile.d/hplsql-path.sh
Configure plhql-site.xml
To configure Hive connection settings, and connectivity to other databases (mySQL, Teradata, IBM DB/2, Oracle, MSSQL)
Test from Command Line
hplsql
--version
Get the previous date:
START=$(hplsql -e 'CURRENT_DATE - 1')
Copy table to a file:
COPY (SELECT id, name FROM sales.users WHERE local_dt = CURRENT_DATE) TO /data/users.txt DELIMITER '\t';
Copy table from default connection (Hive) to Teradata connection
COPY sales.users TO sales.users2 AT tdconn;
Log to mySQL from Hive
MAP OBJECT log TO log.log_data AT mysqlconn; DECLARE cnt INT; SELECT count(*) INTO cnt FROM sales.users WHERE local_dt = CURRENT_DATE; INSERT INTO log (message) VALUES ('Number of users: ' || cnt);
Compare Hive table totals to mySQL
CMP SUM sales.users WHERE local_dt = CURRENT_DATE, users_daily AT mysqlconn;
Great addition to the Hive codebase.
Thursday, January 14, 2016
Working with Jupyter Notebooks
The iPython Notebook and its offshoots, Jupyter, Zeppelin, Spark, etc. are very useful for learning, data science, collaboration, data visualization, and instant information using a REPL (Read-Eval-Print-Loop) interface. A REPL allows you to run and compile code line-by-line, and in the case of Spark and other Hadoop tools run code against a cluster of machines.
A good history of the iPython notebook from Fernando Perez, creator of iPython.
As of this writing, Jupyter, the latest incarnation of iPython's notebook, has over 50 interpreters to parse and compile code within a notebook interface.
Further to the last blog post, search GitHub for Jupyter Notebooks on GitHub to see more examples.
filename:ipynb
https://github.com/search?l=jupyter-notebook&q=filename%3Aipynb&type=Code&utf8=%E2%9C%93
Here are some interesting examples on Github:
Parsing Apache Logs with Spark
Interactive C# Notebook
Predicting Airline Delays with Pig and Python
Binder is just one host of notebooks, an example using CERN's ROOT framework to run C++ in a browser.
http://app.mybinder.org/2191543109/notebooks/index.ipynb
http://app.mybinder.org/2191543109/notebooks/notebooks/ROOT_Example.ipynb
A good history of the iPython notebook from Fernando Perez, creator of iPython.
"We coded frantically in parallel: one of us wrote the kernel and the other the client, and we'd debug one of them while leaving the other running in the meantime. It was the perfect blend of pair programming and simultaneous development, and in just two days we had a prototype of a python shell over zmq working."
As of this writing, Jupyter, the latest incarnation of iPython's notebook, has over 50 interpreters to parse and compile code within a notebook interface.
Further to the last blog post, search GitHub for Jupyter Notebooks on GitHub to see more examples.
filename:ipynb
https://github.com/search?l=jupyter-notebook&q=filename%3Aipynb&type=Code&utf8=%E2%9C%93
Here are some interesting examples on Github:
Parsing Apache Logs with Spark
Interactive C# Notebook
Predicting Airline Delays with Pig and Python
Binder is just one host of notebooks, an example using CERN's ROOT framework to run C++ in a browser.
http://app.mybinder.org/2191543109/notebooks/index.ipynb
http://app.mybinder.org/2191543109/notebooks/notebooks/ROOT_Example.ipynb
Sunday, January 3, 2016
Configs and GitHub Viz
In the case of open-source projects, you may need to dig further into what a particular configuration setting does. If the documentation does not give you enough detailed information on the implementation, you can also trace the configuration details by searching for the file or getting the project from Github or SVN.
Some of the common Hadoop projects configuration code.
Pig configuration
Hive configuration
Sqoop configuration
Flume configuration
Kafka configuration
Github allows you to scope your searches which is useful for narrowing down your search to specific files.
Searching code is documented here. In the search box, you can search by filename:<myfile> or <myfile> in:path to track down particular files. You can also search by language, this searches for Scala files.
At the time of this writing, there's some interesting stats available just by looking at language of repositories in Github
Much cooler than just searching is the GitHub Visualizer created by Artem Zukov using D3js.
Apache's visualization shows an assortment of languages in their repos.
The Hive repo's contributors and file extensions
Some of the common Hadoop projects configuration code.
Pig configuration
Hive configuration
Sqoop configuration
Flume configuration
Kafka configuration
Github allows you to scope your searches which is useful for narrowing down your search to specific files.
Searching code is documented here. In the search box, you can search by filename:<myfile> or <myfile> in:path to track down particular files. You can also search by language, this searches for Scala files.
At the time of this writing, there's some interesting stats available just by looking at language of repositories in Github
- There are 1.5m Java repos with ElasticSearch, Android Universal Image Loader and Reactive Extensions for the JVM showing up as the top 3 best matches.
- There are nearly 900k Python repos with httpie, Flask, the Django framework and the Awesome Python library coming in the top 4 best matches.
- There are 400k C# repos with the .NET framework, SignalR and Mono in the top 3.
- There are 421k C repos with Linux being the best match.
- There are 60k Scala repos with PredictionIO, the Play Framework and Scala itself showing up as top 3 best matches.
Much cooler than just searching is the GitHub Visualizer created by Artem Zukov using D3js.
Apache's visualization shows an assortment of languages in their repos.
Subscribe to:
Posts (Atom)