Big Data Performance Engineering

Examples from Hadoop, Pig, HBase, Flink and Spark

The goal of next generation open-source Big Data frameworks such as Spark and Flink is to match the performance and user-friendly expressiveness of Massively Parallel Processing (MPP) databases such as Netezza. SQL interfaces such as Hive and SparkSQL provide the expressiveness. However, matching the performance is proving to be a much harder . . .

Read More

August 16, 2015

Polymorphism and MapReduce

This article will explore the question - Can polymorphic behavior of Java be used for the Mapper/Reducer input/output key/value classes?

The intuitive answer should be yes. However there are caveats. Understanding these caveats will help understand how the MapReduce framework operates internally.

Rules to follow when using polymorphic . . .

Read More

April 18, 2015

HBase Series: Multi-Tenancy

Namespaces introduced in version 0.96 is a very important feature

One challenge faced by developers working with the pre 0.96 versions of HBase is the inability to create user based environments in HBase for development and testing. This is similar to the concept of Schema in Oracle. When using RDBMS we are used to having each user get his own test database (schema in Oracle). A developer can develop using . . .

Read More

March 29, 2015

Controlling the number of Partitions in Spark

for shuffle transformations (Ex. reduceByKey)

The previous article explored how input partitions are defined by Spark. This short article will describes how partitions are defined when Spark needs to Shuffle data

Transformations which require Data Shuffling

Some transformations will require data to be shuffled. Examples of such transformations in Spark are
1. groupByKey
2. reduceByKey
3. . . .

Read More

March 01, 2015

Controlling Parallelism in Spark

by controlling the input partitions

It is useful to be able to control the degree of parallelism when using Spark. Spark provides a very convenient method to increase the degree of parallelism which should be adequate in practice. This blog entry of the "Inside Spark" series describes the knobs Spark uses to control the degree of parallelism.

Controlling the number of . . .

Read More

February 21, 2015

Sampling Large Datasets using Spark

Inside Spark Series

Unit testing is proving to be a big essential in developing Big Data programs. There is an entire chapter (Chapter 8) devoted to this subject in my book Pro Apache Hadoop

To test Big Data programs effectively you need smaller datasets that are representative of the large dataset. In this chapter we will explore how to create such dataset . . .

Read More

February 09, 2015

Implementing Spark Projects using Java(8)

I am currently learning to use Spark. It seems like a elegant substitute to Hadoop. The syntax is cleaner and it provides better abstractions. Plus it is considerably more flexible as my distributed components can be organized in a DAG instead of chained MapReduce job. I discussed this in my earlier blog article on the MapReduce Itch.

. . .

Read More

February 08, 2015