FREMONT, CA: LinkedIn open sources Dr. Elephant tool, a performance monitoring and tuning tool that helps Hadoop and Spark users understand analyze and improve their workflows.
Dr. Elephant is a performance monitoring and tuning tool for Hadoop and Spark,that automatically gathers all the metrics, runs analysis on them, and presents them in a simple way for easy consumption. The goal of this tool is to improve developer productivity and increase cluster efficiency by making it easier to tune the jobs. It analyzes the Hadoop and Spark jobs using a set of pluggable, configurable, rule-based heuristics that provide insights on how a job performed, and then uses the results to make suggestions about how to tune the job to make it perform more efficiently.
LinkedIn has employees with different levels of experience with Hadoop using different frameworks to run their Hadoop jobs, but due to the growing number of Hadoop users, having regular sessions for different users on distinct frameworks did not work anymore. LinkedIn was unable to verify if they were able to achieve optimal performance for the job or guarantee performance coverage, which is why they needed to standardize and automate the process.
Hadoop is an open-source software framework that facilitates the distributed storage and processing of large distributed datasets involving a number of components interacting with each other. Apache Spark is a fast engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Working of Dr. Elephant
Dr. Elephant gets a list of all recent succeeded and failed applications at regular intervals from the YARN resource manager. The metadata for each application—namely, the job counters, configurations, and the task data—are fetched from the Job History server. Once it has all the metadata, Dr. Elephant runs a set of heuristics on them and generates a diagnostic report on how the individual heuristics and the job as a whole performed. These are then tagged with one of five severity levels, to indicate potential performance problems.
LinkedIn uses Dr. Elephant for many different use cases, including monitoring how a flow is performing on the cluster, understanding why a flow is running slowly, how and what can be tuned to improve a flow, comparing a flow against previous executions, and troubleshooting.
Apart from adding and improving heuristics and extending to newer job types, LinkedIn plans to upgrade, job-specific tuning suggestions based on real-time metrics; Visualizations of jobs’ cluster resource usage and trends; Better Spark integration; integrating more schedulers.