Evaluating Hadoop Clusters with TPCx-HS
Technical Report No. 2015-1
September 11, 2015
Todor Ivanov and Sead Izberovic
The growing complexity and variety of Big Data platforms makes it both difficult and time consuming for all system users to properly setup and operate the systems. Another challenge is to compare the platforms in order to choose the most appropriate one for a particular application. All these factors motivate the need for a standardized Big Data benchmark that can help the users in the process of platform evaluation. Just recently TPCx-HS  has been released as the first standardized Big Data benchmark designed to stress test a Hadoop cluster. The goal of this study is to evaluate and compare how the network setup influences the performance of a Hadoop cluster. In particular, experiments were performed using shared and dedicated 1Gbit networks utilized by the same Cloudera Hadoop Distribution (CDH) cluster setup. The TPCx-HS benchmark, which is very network intensive, was used to stress test and compare both cluster setups. All the presented results are obtained by using the officially available version  of the benchmark, but they are not comparable with the officially reported results and are meant as an experimental evaluation, not audited by any external organization. As expected the dedicated 1Gbit network setup performed much faster than the shared 1Gbit setup. However, what was surprising is the negligible price difference between both cluster setups, which pays off with a multifold performance return.
The rest of the report is structured as follows: Section 2 provides a brief description of the technologies involved in our study. An overview of the hardware and software setup used for the experiments is given in Section 3. Brief summary of the TPCx-HS benchmark is presented in Section 4. The performed experiments together with the evaluation of the results are presented in Section 5. Finally, Section 6 concludes with lessons learned.