Examples from Hadoop, Pig, HBase, Flink and Spark
The goal of next generation open-source Big Data frameworks such as Spark and Flink is to match the performance and user-friendly expressiveness of Massively Parallel Processing (MPP) databases such as Netezza. SQL interfaces such as Hive and SparkSQL provide the expressiveness. However, matching the performance is proving to be a much harder . . .
This article will explore the question - Can polymorphic behavior of Java be used for the Mapper/Reducer input/output key/value classes?
The intuitive answer should be yes. However there are caveats. Understanding these caveats will help understand how the MapReduce framework operates internally.
Rules to follow when using polymorphic . . .
Namespaces introduced in version 0.96 is a very important feature
One challenge faced by developers working with the pre 0.96 versions of HBase is the inability to create user based environments in HBase for development and testing. This is similar to the concept of Schema in Oracle. When using RDBMS we are used to having each user get his own test database (schema in Oracle). A developer can develop using . . .
for shuffle transformations (Ex. reduceByKey)
The previous article explored how input partitions are defined by Spark. This short article will describes how partitions are defined when Spark needs to Shuffle data
Transformations which require Data Shuffling
Some transformations will require data to be shuffled. Examples of such transformations in Spark are
3. . . .
by controlling the input partitions
It is useful to be able to control the degree of parallelism when using Spark. Spark provides a very convenient method to increase the degree of parallelism which should be adequate in practice. This blog entry of the "Inside Spark" series describes the knobs Spark uses to control the degree of parallelism.
Controlling the number of . . .
Inside Spark Series
To test Big Data programs effectively you need smaller datasets that are representative of the large dataset. In this chapter we will explore how to create such dataset . . .
I am currently learning to use Spark. It seems like a elegant substitute to Hadoop. The syntax is cleaner and it provides better abstractions. Plus it is considerably more flexible as my distributed components can be organized in a DAG instead of chained MapReduce job. I discussed this in my earlier blog article on the MapReduce Itch.
. . .