A Glance at the Hadoop Failure Model

By nathan marz | Published: July 17, 2009

Hadoop is designed to be a fault tolerant system. Jobs should be resilient to nodes going down and other random failures. Hadoop isn’t perfect however, as I still see jobs failing due to random causes every now and again. I decided to investigate the significance of the different factors that play into a job failing.
A Hadoop job fails if the same task fails some predetermined amount of times (by default, four). This is set through the properties “mapred.map.max.attempts” and “mapred.reduce.max.attempts”. For a job to fail randomly, an individual task will need to fail randomly this predetermined amount of times. A task can fail randomly for a variety of reasons – a few of the ones we’ve seen are disks getting full, a variety of bugs in Hadoop, and hardware failures.
The formula for the probability of a job failing randomly can be derived as follows:



Pr[individual task failing maximum #times] = Pr[task failing] ^ (max task failures)

Pr[task succeeding] = 1 - Pr[individual task failing maximum #times]

Pr[job succeeding] = Pr[task succeeding] ^ (num tasks)

Pr[job failing] = 1 - Pr[job succeeding]

Pr[job failing] = 1 - (1-Pr[task failing] ^ (max task failures))^(num tasks)

The maximum amount of task failures is set through the property “mapred.max.tracker.failures” and defaults to 4.
Let’s take a significant workload of 100,000 map tasks and see what the numbers look like:

As the probability of a task failing goes above 1%, the probability of the job failing rapidly increases. It is very important to keep the cluster stable and keep the failure rate relatively small, as these numbers show Hadoop’s failure model only goes so far. We can also see the importance of the “max task failures” parameter, as values under 4 cause the probability of job failures to rise to significant values even with a 0.5% probability of task failure.
Reducers run for a much longer period of time than mappers, which means a reducer has more time for a random event to cause it to fail. We can therefore say that the probability of a reducer failing is much higher than a mapper failing. This is balanced out by the fact that there are a much smaller amount of reducers. Let’s look at some numbers more representative of a job failing due to reducers failing:

The probabilities of a reducer failing need to go up to 10% to have a significant chance of failure.

Bad Nodes

One more variable to consider in the model is bad nodes. Oftentimes nodes go bad and every task run on them fails, whether because of a disk going bad, the disk filling up, or other causes. With a bad node, you typically see a handful of mappers and reducers fail before the node gets blacklisted and no more tasks are assigned to it. In order to simplify our analysis, let’s assume that each bad node causes a fixed number of tasks to fail. Additionally, let’s assume a task can only be affected by a bad node once, which is reasonable because nodes are blacklisted fairly quickly. Let’s call the tasks which fail once due to a bad node “b-tasks” and the other tasks “n-tasks”. A “b-task” starts with one failure, so it needs to fail randomly “max task failures – 1″ times to cause the job to fail. On our cluster, we typically see a bad node cause three tasks to automatically fail, so using that number the modified formula ends up looking like:



#b-tasks = #bad nodes * 3

Pr[all b-tasks succeeding] = (1-Pr[task failing] ^ (max task failures - 1))^(#b-tasks)

Pr[all n-tasks succeeding] =  (1-Pr[task failing] ^ (max task failures))^(num tasks - #b-tasks)

Pr[job succeeding] = Pr[all b-tasks succeeding] * Pr[all n-tasks succeeding]

Pr[job succeeding] = (1-Pr[task failing] ^ (max task failures - 
1))^(#b-tasks) * (1-Pr[task failing] ^ (max task failures))^(num tasks -
 #b-tasks)

Pr[job failing] = 1 - Pr[job succeeding]

Pr[job failing] = 1 - (1-Pr[task failing] ^ (max task failures - 1))^(#b-tasks) * (1-Pr[task failing] ^ (max task failures))^(num tasks - #b-tasks)

Since there are so many mappers, the results of the formula won’t change for a handful of bad nodes. Given that the number of reducers is relatively small though, the numbers do change somewhat:

Happily the numbers aren’t too drastic – five bad nodes causes the failure rate to increase by 1.5x to 2x.
In the end, Hadoop is fairly fault tolerant as long as the probability of a task failing is kept relatively low. Based on the numbers we’ve looked at, 4 is a good value to use for “max task failures”, and you should start worrying about cluster stability when the task failure rate approaches 1%. You could always increase the “max task failures” properties to increase robustness, but if you are having that many failures you will be suffering performance penalties and would be better off making your cluster more stable.

Tao's Blog

Tuesday, 27 March 2012