The Bay Area Hadoop User Group meetup has over 5.2k members. Meetups like these are hugely popular and provide great resources, slides, etc.
Searching Google brings back 820+ results for PDFs related to Hadoop from Meetup files. Lots of great information here.
Google inurl:files.meetup.com Hadoop for Hadoop resources or any other meetup topics you might find interesting. Over 76,000 resources by searching for just inurl:files.meetup.com.
Filling up the Kindle...
Search This Blog
Friday, December 11, 2015
Friday, November 6, 2015
Closer look at U-SQL, MIcrosoft's HiveQL
Microsoft U-SQL is the query language used on Azure Data Lake Analytics services. Based on SCOPE and Cosmos, which has been around since at least 2008, It combines C# type / expressions functionality, schema-on-read, custom processors and reducers into a SQL-like ETL and output language.
Keywords need to be upper case. The where clause uses C#-style == syntax. Rows can contain up to 4MB of data per row.
U-SQL supports SQL.MAP<k,v> and SQL.ARRAY(<T>)
U-SQL supports inline C# expressions, UDFs, UDAs to custom aggregate, UDOs to generate process and consume rowsets.
U-DOs are user-defined operators build with Visual Studio.
https://azure.microsoft.com/pt-pt/documentation/articles/data-lake-analytics-u-sql-develop-user-defined-operators/
It will be interesting to see if this language makes it into SQL Server itself. Extractors and Outputters would be highly useful to replace some of the functionality of SSIS.
I built a similar tool a few years ago for schema-on-read. It brought CSV files into BLOB columns in SQL Server (read my article on BLOBs on SQL Server Central) and allowed you to query them by converting to nvarchar(max), applying a schema, and then outputting to a table.
Kind of felt like a data lake at the time.... though it wasn't massively parallel and didn't have any kind of map-reduce job spinning up. Then MS introduced the filestream object...
Keywords need to be upper case. The where clause uses C#-style == syntax. Rows can contain up to 4MB of data per row.
U-SQL supports SQL.MAP<k,v> and SQL.ARRAY(<T>)
U-SQL supports inline C# expressions, UDFs, UDAs to custom aggregate, UDOs to generate process and consume rowsets.
U-DOs are user-defined operators build with Visual Studio.
https://azure.microsoft.com/pt-pt/documentation/articles/data-lake-analytics-u-sql-develop-user-defined-operators/
It will be interesting to see if this language makes it into SQL Server itself. Extractors and Outputters would be highly useful to replace some of the functionality of SSIS.
I built a similar tool a few years ago for schema-on-read. It brought CSV files into BLOB columns in SQL Server (read my article on BLOBs on SQL Server Central) and allowed you to query them by converting to nvarchar(max), applying a schema, and then outputting to a table.
Kind of felt like a data lake at the time.... though it wasn't massively parallel and didn't have any kind of map-reduce job spinning up. Then MS introduced the filestream object...
Tuesday, October 27, 2015
Virtualbox error - Kernel driver after Centos Update
On my Centos7 box, after an update I lost the kernel sources. Virtualbox would no longer start a VM due to updates requiring a recompile.
Running usr/sbin/rcvboxdrv setup
showed some errors in cat /var/log/vbox-install.log
After removing & reinstalling kernel sources and running above command again, Virtualbox recompiled the kernel.
yum remove kernel-devel gcc
yum install kernel-devel gcc
Unfortunately this may remove some dependencies also, backup your environment!
Then had to reboot to avoid the "Creating a process..." message for VirtualBox.
Running usr/sbin/rcvboxdrv setup
showed some errors in cat /var/log/vbox-install.log
After removing & reinstalling kernel sources and running above command again, Virtualbox recompiled the kernel.
yum remove kernel-devel gcc
yum install kernel-devel gcc
Unfortunately this may remove some dependencies also, backup your environment!
Then had to reboot to avoid the "Creating a process..." message for VirtualBox.
Friday, October 23, 2015
ZSH and Oh-My-Zsh Shell Plugins
I remember a long, long time ago, in a galaxy far, far away, I played around with setting up custom DOS prompts. Memories of Ansi.sys and custom ANSI art come streaming into my brain...
Forget all that. On CentOS, these two commands will install the Z Shell and Oh-My-Zsh
yum install zsh
sh -c "$(wget https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh -O -)"
Tab allows you to visualize potential paths, running processes, ls without hitting enter, and other awesomeness.
There are one or two Themes and Plugins available.
Some laundry lists, tricks and cheat sheets.
If you're running Windows, 720MB of Babun will get you Zsh among other things...
Forget all that. On CentOS, these two commands will install the Z Shell and Oh-My-Zsh
yum install zsh
sh -c "$(wget https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh -O -)"
Tab allows you to visualize potential paths, running processes, ls without hitting enter, and other awesomeness.
There are one or two Themes and Plugins available.
Some laundry lists, tricks and cheat sheets.
If you're running Windows, 720MB of Babun will get you Zsh among other things...
Sunday, October 4, 2015
Hue on HDInsight and HDInsight on Linux
Microsoft might have just made Data Lakes a commodity offering.
Convergence with the Linux realm is happening again at Microsoft with the introduction of Hue on HDInsight (a graphical interface for Hadoop/HDP) and HDInsight on Linux. Hue has been around for quite awhile in the Apache realm and in most Hadoop distros, glad do see HDInsight is finally getting a user-friendly GUI.
Another announcement introduces U-SQL (see Michael Rys (@MikeDoesBigData) Introducing U-SQL). A SQL-like, Hive/Pig/Grep/Awk combo language to ELT+QE (Extract/Load/Transform + Query/Extract) on top of the HDInsight Big Data Lake.
The biggest announcement is the Azure Data Lake itself...
Convergence with the Linux realm is happening again at Microsoft with the introduction of Hue on HDInsight (a graphical interface for Hadoop/HDP) and HDInsight on Linux. Hue has been around for quite awhile in the Apache realm and in most Hadoop distros, glad do see HDInsight is finally getting a user-friendly GUI.
Another announcement introduces U-SQL (see Michael Rys (@MikeDoesBigData) Introducing U-SQL). A SQL-like, Hive/Pig/Grep/Awk combo language to ELT+QE (Extract/Load/Transform + Query/Extract) on top of the HDInsight Big Data Lake.
The biggest announcement is the Azure Data Lake itself...
Shared folders with Virtualbox and Centos 7
Got a build error with VirtualBox add-ins and HDP Sandbox 2.2.
Building the main Guest Additions module [FAILED]
Fixed by checking the log file for errors, Missing kernel directory issue.
$ export KERN_DIR=/lib/modules/2.6.32-504.1.3.el6.x86_64/
Building the main Guest Additions module [FAILED]
Fixed by checking the log file for errors, Missing kernel directory issue.
$ export KERN_DIR=/lib/modules/2.6.32-504.1.3.el6.x86_64/
$ cd
/usr/src/kernels
$ ln
-s /usr/src/kernels/2.6.32-573.7.1.el6.x86_64/ 2.6.32-504.1.3.el6.x86_64
Thursday, September 24, 2015
Multi tab Putty
Just as it sounds, multiple tabbed putty.
http://ttyplus.com/
And for Windows, there's Clover
http://ejie.me/
http://ttyplus.com/
And for Windows, there's Clover
http://ejie.me/
Sunday, September 20, 2015
Scala kernel for Jupyter notebook and randomness
Some links for Jupyter Notebook setup and randomness...
The Jupyter Scala kernel from Alexandre Archambault
https://github.com/alexarchambault/jupyter-scala
The Jupyter Spark kernel from Brian Schlining
https://github.com/hohonuuli/sparknotebook
Something random from Bryan - convert Genome information to Midi music.
https://github.com/hohonuuli/dna-music/blob/master/README
Something else, video to ascii with akka streams?
https://github.com/hohonuuli/streamerz
Jupyter Server
https://github.com/jupyter/jupyterhub
Setting up public server
http://jupyter-notebook.readthedocs.org/en/latest/public_server.html
Java 9 Kernel for some REPL prototyping
https://github.com/Bachmann1234/java9_kernel
http://blog.takipi.com/5-features-in-java-9-that-will-change-how-you-develop-software-and-2-that-wont/
Adding R to Jupyter
http://ihrke.github.io/jupyter.html
The Jupyter Scala kernel from Alexandre Archambault
https://github.com/alexarchambault/jupyter-scala
The Jupyter Spark kernel from Brian Schlining
https://github.com/hohonuuli/sparknotebook
Something random from Bryan - convert Genome information to Midi music.
https://github.com/hohonuuli/dna-music/blob/master/README
Something else, video to ascii with akka streams?
https://github.com/hohonuuli/streamerz
Jupyter Server
https://github.com/jupyter/jupyterhub
Setting up public server
http://jupyter-notebook.readthedocs.org/en/latest/public_server.html
Java 9 Kernel for some REPL prototyping
https://github.com/Bachmann1234/java9_kernel
http://blog.takipi.com/5-features-in-java-9-that-will-change-how-you-develop-software-and-2-that-wont/
Adding R to Jupyter
http://ihrke.github.io/jupyter.html
Saturday, September 12, 2015
Jupyter, iPython and AVG Antivirus
Linux tools like cygwin and python don't play well with security and Windows.
Not sure why I bother trying to work with linux apps on a wintel box anyway, but if anyone else is...
Exclude c:\python34 directory in virus scanner
http://useragent.xyz/lost-and-important-file-for-python-34/
to
https://try.jupyter.org/
by running
pip install jupyter
for
https://github.com/zabirauf/icsharp
by running
choco install icsharp
which installs Python 3.4.3... sigh
and throws some 404 error
and installs 2 / 4 packages,
not including icsharp.
Looks like http://python-distribute.org/distribute_setup.py is for sale. :)
git clone https://github.com/zabirauf/icsharp
If you're not familiar with iPython, now called Jupyter to be language agnostic, it is pretty awesome and distributed peer programming will never be the same.
http://www.nature.com/news/ipython-interactive-demo-7.21492
https://www.authorea.com/
https://cloud.sagemath.com/
https://wakari.io/
Now if I could only get the shift-enter compile execution shortcut working with OneNote and C#, I would be so happy.
http://tryroslyn.azurewebsites.net/
Not sure why I bother trying to work with linux apps on a wintel box anyway, but if anyone else is...
Exclude c:\python34 directory in virus scanner
http://useragent.xyz/lost-and-important-file-for-python-34/
to
https://try.jupyter.org/
by running
pip install jupyter
for
https://github.com/zabirauf/icsharp
by running
choco install icsharp
which installs Python 3.4.3... sigh
and throws some 404 error
and installs 2 / 4 packages,
not including icsharp.
Looks like http://python-distribute.org/distribute_setup.py is for sale. :)
git clone https://github.com/zabirauf/icsharp
If you're not familiar with iPython, now called Jupyter to be language agnostic, it is pretty awesome and distributed peer programming will never be the same.
http://www.nature.com/news/ipython-interactive-demo-7.21492
https://www.authorea.com/
https://cloud.sagemath.com/
https://wakari.io/
Now if I could only get the shift-enter compile execution shortcut working with OneNote and C#, I would be so happy.
http://tryroslyn.azurewebsites.net/
Monday, August 17, 2015
Engine Noise and the Internet of Things
According to a recent blog post by Stephen Few, Data Visualization Guru, "The exponential growth in raw data that we’re experiencing is mostly producing noise."
I used to be a car audiophile of sorts. It was mainly about the highest tweets and lowest subs. Surprisingly my hearing didn't get permanently damaged, though I did crack a windshield and shake off my rear view mirror a few times. I still have one of these in my garage...
On a recent road trip, I introduced my kids to Pink Floyd's Dark Side of The Moon. I realized during the intro to Money that the right half of my speakers weren't putting anything out. I hadn't noticed prior to this, since most of the music I listen to now is on the radio and is really just noisy filler for my commute.
Unlike my faulty right door speaker, a pilot might notice more if the right half of their aircraft wasn't putting out any power. I don't think they would even need any instruments to tell them something is wrong. A few years ago, there was a frequently quoted IoT statistic put out about Boeing 787's creating a half-terabyte of data per flight. That's about 12,500x the amount of data a plane from 1977 might generate.
After all this data is generated, the results need to be interpreted by the plane's computer, the pilot, and ground crew and actioned on. Sometimes in real-time, sometimes even predictively. Perhaps 85% of the data could be considered noise. That's still about 75GB of data to scan for each flight. If it's text data it could be shrunk down to under 10GB. Not quite big data anymore if we get rid of the noise.
I couldn't find a sample of this data, though I did find this report on Noise data for the first 17 months of Boeing 787 operations at Heathrow airport.
According to the study, the Dreamliner is 3-8db quieter than similar aircraft. That's about the equivalent of someone breathing, though I guess at sustained time intervals and multiplied by the number of aircraft in flight it could make a difference. The study might have been helped (or hindered) by including some visual and audio samples for reference.
Rather than capturing hundreds of statistics and spending months and countless dollars studying flight patterns, a great gig in the sky, the better metric might have been "is it quieter than the Concorde?"
How can we determine signals from so much noise?
I used to be a car audiophile of sorts. It was mainly about the highest tweets and lowest subs. Surprisingly my hearing didn't get permanently damaged, though I did crack a windshield and shake off my rear view mirror a few times. I still have one of these in my garage...
On a recent road trip, I introduced my kids to Pink Floyd's Dark Side of The Moon. I realized during the intro to Money that the right half of my speakers weren't putting anything out. I hadn't noticed prior to this, since most of the music I listen to now is on the radio and is really just noisy filler for my commute.
Unlike my faulty right door speaker, a pilot might notice more if the right half of their aircraft wasn't putting out any power. I don't think they would even need any instruments to tell them something is wrong. A few years ago, there was a frequently quoted IoT statistic put out about Boeing 787's creating a half-terabyte of data per flight. That's about 12,500x the amount of data a plane from 1977 might generate.
After all this data is generated, the results need to be interpreted by the plane's computer, the pilot, and ground crew and actioned on. Sometimes in real-time, sometimes even predictively. Perhaps 85% of the data could be considered noise. That's still about 75GB of data to scan for each flight. If it's text data it could be shrunk down to under 10GB. Not quite big data anymore if we get rid of the noise.
I couldn't find a sample of this data, though I did find this report on Noise data for the first 17 months of Boeing 787 operations at Heathrow airport.
According to the study, the Dreamliner is 3-8db quieter than similar aircraft. That's about the equivalent of someone breathing, though I guess at sustained time intervals and multiplied by the number of aircraft in flight it could make a difference. The study might have been helped (or hindered) by including some visual and audio samples for reference.
Rather than capturing hundreds of statistics and spending months and countless dollars studying flight patterns, a great gig in the sky, the better metric might have been "is it quieter than the Concorde?"
How can we determine signals from so much noise?
Wednesday, June 24, 2015
Garbage In, Garbage In, Garbage In
Many projects in the Apache ecosystem run Java. One of the places developers spend time in when dealing with performance issues is the Java Virtual Machine's (JVM) Garbage Collection options. When the heap becomes full, garbage is collected.
In this past, I have seen that .NET apps that explicitly call the garbage collector improved performance, especially when dealing with black-box code that doesn't dispose of objects itself nicely or bloats memory due to poor design. I have also seen where it will destroy performance for every .NET application on the machine.
In .NET 4.6 RC,
http://stackoverflow.com/questions/118633/whats-so-wrong-about-using-gc-collect
At this point, suppose that performance plays a fundamental role and the slightest alteration in the program's flow could bring catastrophic consequences. Object creation is then reduced to the minimum possible by using object pools and the such but then, the GC chimes in unexpectedly and throws it all away, and someone dies.
Well that got dark really fast, stackoverflow.
Oracle has a good document around the concepts of the Heap and the Nursery. When the nursery fills up, the older ones leave to public school. When public school fills up, the oldest are forced out into the real world.
https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/garbage_collect.html
Databricks, the Spark folks, and Intel, recently posted a great article about how GC works with Spark and how to tune Spark instances for optimized JVM garbage collection which inspired (and augmented some content for) this post.
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
In this past, I have seen that .NET apps that explicitly call the garbage collector improved performance, especially when dealing with black-box code that doesn't dispose of objects itself nicely or bloats memory due to poor design. I have also seen where it will destroy performance for every .NET application on the machine.
In .NET 4.6 RC,
- Enhancements to garbage collection (GC)The GC class now includes TryStartNoGCRegion and EndNoGCRegion methods that allow you to disallow garbage collection during the execution of a critical path.A new overload of the GC.Collect(Int32, GCCollectionMode, Boolean, Boolean) method allows you to control whether both the small object heap and the large object heap are swept and compacted or swept only.
http://stackoverflow.com/questions/118633/whats-so-wrong-about-using-gc-collect
At this point, suppose that performance plays a fundamental role and the slightest alteration in the program's flow could bring catastrophic consequences. Object creation is then reduced to the minimum possible by using object pools and the such but then, the GC chimes in unexpectedly and throws it all away, and someone dies.
Well that got dark really fast, stackoverflow.
Oracle has a good document around the concepts of the Heap and the Nursery. When the nursery fills up, the older ones leave to public school. When public school fills up, the oldest are forced out into the real world.
https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/garbage_collect.html
Databricks, the Spark folks, and Intel, recently posted a great article about how GC works with Spark and how to tune Spark instances for optimized JVM garbage collection which inspired (and augmented some content for) this post.
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
$500 in Google Cloud Credit with Free MapR Hadoop Training
What does MapR get from Google? $110 million in capital financing.
What do you get a Google Cloud Engine $500 free credit with MapR training? Apparently quite a bit...
Hadoop / HBase / Drill Training Link here...
https://www.mapr.com/company/press-releases/mapr-collaborates-google-cloud-platform-offer-500-credit-resources-mapr-fre-0
Sandbox VM download here
https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill
What do you get a Google Cloud Engine $500 free credit with MapR training? Apparently quite a bit...
Compute Engine
5 x Servers
- 434.524 total hours per month
- VM class: Regular
- Instance type: n1-highmem-16
- Region: United States
- Total Estimated Cost: $438.00
Persistent Disk
- SSD storage: 0 GB
- Storage: 100 GB
- Snapshot storage: 0 GB
- $4.00
GCE Network Bandwidth
- Egress - Americas/EMEA: 200 GB
- Egress - Asia/Pacific: 0 GB
- Egress - Australia: 0 GB
- Egress - China: 0 GB
- Google Cloud Interconnect United States: 0 GB
- Google Cloud Interconnect Europe: 0 GB
- Google Cloud Interconnect Asia/Pacific: 0 GB
- Egress to a different Zone in the same Region: 0 GB
- Egress to a different Region within the US: 0 GB
- $24.00
Monthly total: $466.00
If you don't want 128GB of ram and 5 servers in your cluster, you could be a peon and buy some pre-emptible Instances to go the cheaper route.
https://www.mapr.com/company/press-releases/mapr-collaborates-google-cloud-platform-offer-500-credit-resources-mapr-fre-0
Sandbox VM download here
https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill
Wednesday, May 6, 2015
Elements of Scale
Amazing, comprehensive article around relational, NoSQL, and many other approaches to reading and writing information.
http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/
If a relational database can't solve a specific problem efficiently and timely, perhaps throwing the kitchen sink, or data platform at it could...
http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/
If a relational database can't solve a specific problem efficiently and timely, perhaps throwing the kitchen sink, or data platform at it could...
Subscribe to:
Posts (Atom)