I am currently learning to use Spark. It seems like a elegant substitute to Hadoop. The syntax is cleaner and it provides better abstractions. Plus it is considerably more flexible as my distributed components can be organized in a DAG instead of chained MapReduce job. I discussed this in my earlier blog article on the MapReduce Itch.
Currently I intend to go through two major books on Spark
- Learning Spark - This books seems to handle the basics extremely well and based on the first four chapters I have read it seems like an excellent choice for the first book on Spark.
- Advanced Analytics With Spark - This one looks more advanced and tackles a topic I have been planning to revist for a over a year - Data Science and Statistical Data Mining. It has been slightly over 4 years since I finished my Masters program in Applied Mathematics and Spark seems like an excellent choice to put my Big Data expertise to use in the context of exploring this exciting new field.
I am checking in all the code I write in this endeavor in the following Github repository.
Spark with Java8 or Scala?
Currently there is a major debate going on in the Spark community about which is a better language to do Spark development, Scala or Java? The popular opinion (at-least on the online forums) seems to be biased in favor of Scala. One reason often touted is that Spark is written in Scala, hence Spark projects should be done in Scala. Using Java presents the so called "Impedence Mismatch". This is where I strongly disagree!!
What is Impedence Mismatch?
Personally I don't like the term at all. It has got nothing to do with the engineering definition of the term "Impedence" but then again it has caught on. So I will play along and offer two real world examples of "Impedence Mismatch"-
- Using Python to develop MapReduce programs on top of the Hadoop framework
- Using Shell Scripts to develop MapReduce programs on top of the Hadoop Framework
Both of the above examples will use Hadoop Streaming to achieve their goals. Why are they examples of "Impendence Mismatch"? Because you need to cross process boundaries when using Python or Shell Scripts with Hadoop. It is terribly inconvenient, renders most of the Hadoop API in-accessible and it is good enough for only the most basic tasks. You certainly won't add Hadoop to your Enterprise Architecture if you were only going to be using Python and Shell-Scripts to develop MapReduce jobs.
Why is Java for Spark not an example Impedence Mismatch?
Because Scala programs run in the JVM. Java libraries invoked on the Scala program run in the same JVM as the Scala program invoking it. The opposite is also true. Scala libraries invoked from a Java program execute in the same JVM. So where is the "Impedence Mismatch"?
Because it provides Functional Capabilities which Java (upto version 7) does not provide. And Functional programs written using Java 7 and below look ugly to look and and inconvenient to maintain. But just inconvenient mind you, not un-maintainable as the popularity of Google Guava API has demonstrated for over 5 years.
Enter Java 8 - Functional Programming in Java
But Java 8 has changed all that. You can now write truly functional programs in Java 8. The strongest argument to support Scala over Java for Spark projects is now considerably weakened with the availability of Java 8. Also Java 7 seems like it will reach its end of life in April 2015. Anyways you will be on Java 8 very soon. Why would you select another language to implement your Spark projects?
Sticking with Java 8 seems more prudent
Going to Scala may feel cool but it will introduce risks which in my opinion, do not provide compensating benefits. It will be harder to find experienced Scala programmers. Most will be recently retained Java programmers anyways who are stumbling around the language.
Distributed computing projects based on Hadoop and Spark are inherently risky and time-consuming. If you read the online articles you might feel Spark will suddenly make distributed programming trivial but when I think of my projects in the context of Spark, while the code will look cleaner and more concise, the key design challenges will remain. Except more abstractions mean more decisions to make to squeeze out the maximum performance. Anyone who has programmed in Pig or Hive will know what I am talking about.
You will need a deeper understanding of how the underlying Spark framework operates!! See this excellent talk by Aaron Davidson for an illustration of what I mean by "deeper understanding of how the underlying framework operates"
But to gain this deeper understanding you will still need to learn Scala!!
Why you still need to learn Scala?
Even though I prefer Java 8 to develop my own applications in Spark, it is important to be able to read Scala code. The hardest problems in Hadoop are solved by peeking into the Hadoop code. It is an essential skill. This will be true in the Spark world as well. You will need to peek into Spark code to write high performance Spark Application and Spark is written in Scala. However, reading programs developed in a relatively unfamiliar language is easier than writing programs in an unfamiliar language.
The purpose of this entry is not motivated by laziness. I am learning Scala to ensure I feel comfortable reading Scala programs. But if I were to be starting a project in Spark tomorrow I will stick with Java. I will choose Java 8 if possible but if not, I will still stick to Java (7 and below).
I am aware that not all Spark libraries support the Java, for example, the GraphX library of Spark does not have a Java API. But If I need to use GraphX I will write those components in Scala but I will write the rest in Java 8.
My Github Project- Spark Using Java 8
I intend to explore Spark in its all its gory detail on my Github project. The goal of this project is not to have just some simple word count type pedantic applications. I intend to write some complex Spark code in Java 8. I want to determine for myself if Java 8 is an adequate platform and everything I said above, which based on inferences drawn using basic programs, hold true when we throw complex problems at Spark.
Configuration for Eclipse on Windows
If you are like me you like using Eclipse on Windows 7 for your development. If you import my project and attempt to run the WordCount program you will get an error message indicating the the "winutils.exe" was not found. To fix this perform the following steps (I found this when exploring the following StackOverflow link)-
- Download the contents of the following link and copy to your
- If you cannot be bothered to configure your
%HADOOP_HOME% environment variable then simple copy to the subfolder "bin" inside any folder of your choice. Ex. Your path containing the files in that link might be
c:/MyHadoop/bin . You will know it is right if
c:/MyHadoop/bin contains a file called
- Configure the System property
hadoop.home.dir to contain the value
c:/MyHadoop/. Be careful not to include the
bin sub-folder when you set that system property. There are two ways to do that. Pass the following -D option on the command line
-Dhadoop.home.dir=c:/MyHadoop/ or at the beginning of your Java
main function invoke the following line of code
Your Spark environment is now configured on Eclipse using Windows. The rest of the configuration is handled through the pom file in my Github project.
By Sameer Wadkar - Big Data Architect/Programmer, Author and Open Source Contributor