Search This Blog

Thursday, January 14, 2016

Working with Jupyter Notebooks

The iPython Notebook and its offshoots, Jupyter, Zeppelin, Spark, etc. are very useful for learning, data science, collaboration, data visualization, and instant information using a REPL (Read-Eval-Print-Loop) interface.  A REPL allows you to run and compile code line-by-line, and in the case of Spark and other Hadoop tools run code against a cluster of machines.

A good history of the iPython notebook from Fernando Perez, creator of iPython.
"We coded frantically in parallel: one of us wrote the kernel and the other the client, and we'd debug one of them while leaving the other running in the meantime.  It was the perfect blend of pair programming and simultaneous development, and in just two days we had a prototype of a python shell over zmq working."


As of this writing, Jupyter, the latest incarnation of iPython's notebook, has over 50 interpreters to parse and compile code within a notebook interface.

Further to the last blog post, search GitHub for Jupyter Notebooks on GitHub to see more examples.
filename:ipynb

https://github.com/search?l=jupyter-notebook&q=filename%3Aipynb&type=Code&utf8=%E2%9C%93

Here are some interesting examples on Github:

Parsing Apache Logs with Spark
Interactive C# Notebook
Predicting Airline Delays with Pig and Python

Binder is just one host of notebooks, an example using CERN's ROOT framework to run C++ in a browser.
http://app.mybinder.org/2191543109/notebooks/index.ipynb
http://app.mybinder.org/2191543109/notebooks/notebooks/ROOT_Example.ipynb

Sunday, January 3, 2016

Configs and GitHub Viz

In the case of open-source projects, you may need to dig further into what a particular configuration setting does.  If the documentation does not give you enough detailed information on the implementation, you can also trace the configuration details by searching for the file or getting the project from Github or SVN.

Some of the common Hadoop projects configuration code.

Pig configuration
Hive configuration
Sqoop configuration
Flume configuration
Kafka configuration

Github allows you to scope your searches which is useful for narrowing down your search to specific files.

Searching code is documented here.  In the search box, you can search by filename:<myfile> or <myfile> in:path to track down particular files.    You can also search by language, this searches for Scala files.

At the time of this writing, there's some interesting stats available just by looking at language of repositories in Github

  • There are 1.5m Java repos with ElasticSearch, Android Universal Image Loader and Reactive Extensions for the JVM showing up as the top 3 best matches.
  • There are nearly 900k Python repos with httpie, Flask, the Django framework and the Awesome Python library coming in the top 4 best matches.
  • There are 400k C# repos with the .NET framework, SignalR and Mono in the top 3.
  • There are 421k C repos with Linux being the best match.
  • There are 60k Scala repos with PredictionIO, the Play Framework and Scala itself showing up as top 3 best matches.

Much cooler than just searching is the GitHub Visualizer created by Artem Zukov using D3js.

Apache's visualization shows an assortment of languages in their repos.

The Hive repo's contributors and file extensions