Check Lists: The Low Hanging Fruit of Process Safety

Check Lists: The Low Hanging Fruit of Process Safety

Dan Cojocaru – akh.deocamdata.ro

Angus Keddie – Process Safety Matters

One of the remaining ‘low hanging fruit’ in the realm of Process Safety is the rigorous application of appropriate checklists. In his book, ‘The Checklist Manifesto’ Atul Gawande argues that the preparation and adherence to hazardous activity checklists is a low cost, low effort route to reducing incidents. Working in the Health sector, he discovered numerous examples of checklist success, including the prevention of 8 deaths and 42 infections over 27 months by the adoption of a 5-point checklist in the Intensive Care Unit of a prestigious American Hospital.

Introduction

There are powerful examples of the use of checklists in aviation, medicine and high-rise construction. This article will explore how we can apply this principle to High Hazard Process Plants by:

  • Examining the historical relationship between checklists, process safety incidents and their prevention.
  • Demonstrating a correlation and postulate causation between appropriate use of checklists and a reduction in process safety
  • Describing best practices in the use of process safety checklists.
  • Reviewing the evolution of checklists in related industries to determine which elements may be transferable.
  • Contrasting and comparing current checklist tools.

An accident driving the checklist idea

On October 30, 1935 the US Army Air Corps held a flight competition for airplane manufacturers vying to build the military’s next-generation long-range bombers.

A small crowd of army brass and manufacturing executives watched as the “flying fortress” of Boeing Model 299 test plane taxied onto the runway. The competitor was a much smaller airplane designed by Martin and Douglas.

Boeing’s model was sleek and impressive, with a 103-foot wingspan and four engines jutting out from the wings, rather than the usual two.

Fig 1. Boeing Model 299

The plane roared down the tarmac, lifted off smoothly, and climbed sharply to three hundred feet. Then it stalled, turned on one wing, and crashed in a fiery explosion. Two of the five crew members died, including the test pilot. An investigation revealed that nothing mechanical had gone wrong. The crash had been due to “pilot error,” the report said.

Fig 2. The Boeing Model 299 on fire after its crash at Wright Field on Oct. 30, 1935

Substantially more complex than the previous aircraft, the new plane required the pilot to attend to the four engines, each with its own oil-fuel mix, the retractable landing gear, the wing flaps, electric trim tabs that needed adjustment to maintain stability at different airspeeds, and constant-speed propellers whose pitch had to be regulated with hydraulic controls, among other features.

While doing all this, the pilot had forgotten to release a new locking mechanism on the elevator and rudder controls. The Boeing model was deemed, as a newspaper put it, “too much airplane for one man to fly.” The Army Air Corps declared Douglas’s smaller design the winner.

So, a group of test pilots from Boeing got together and considered what to do. The test pilots made a simple list, brief, and to the point—short enough to fit on an index card, with step-by-step checks for takeoff, flight, landing, and taxiing. With the list, they check that the brakes are released, that the instruments are set, that the door and windows are closed, that the elevator controls are unlocked—dumb stuff. With the checklist in hand, the pilots went on to fly the Model 299 a total of 1.8 million miles without one accident. The army ultimately ordered almost thirteen thousand aircraft.

The historical relationship between checklists, Process Safety incidents and their prevention

Analyzing Fertilizer Industry Operational Risks Database (FIORDA) case studies, we notice that the energy and petrochemical industries still need to improve in learning lessons from incidents.  This view is in part prompted by the reoccurrence of similar events, and also by anecdotal evidence of the difficulty of achieving long term changes in behavior and working processes when sharing lessons from accidents and incidents.

Previous research has indicated challenges at several stages in the Learning from Incidents (LFI) process including:

  • Reluctance to report incidents due to fear of disciplinary action or the perception that reporting does not lead to any change
  • Lack of human factors expertise in the analysis of incidents
  • Lack of time and resources dedicated to helping people understand and make sense of disseminated lessons learnt across various disciplines and departments (e.g. operation, maintenance, QA/QC, project teams, etc.)
  • Overload of recommendations, and the failure for all involved parties to agree on actions.
  • Failure to check that implemented changes have addressed the underlying causes and reduced risk.

It is a common experience to hear the phrase, “We must learn lessons from this,” following a major accident or following a more everyday event such as losing in a sporting competition. Indeed, this has become such a common phrase that one may feel that learning lessons are an automatic or natural process. In fact, the evidence from accidents and incidents indicates that it can be challenging for major hazard industries to learn effectively from such events.

If the only changes an organization is making are in response to Learning From Major Accidents (LFMA) rather than the broad range of potential events, as represented in the accident pyramid in Figure 1, this will typically lead to large disruptive changes following such major accident events in which risk will be reduced by large outlays in new safety-related equipment, with high associated capital expenditure (CAPEX) costs and reduced plant availability during the implementation of such changes.

Over the longer term however, the memory of these low-frequency events will be lost, and risk is likely to increase effectively unnoticed as the warning signs (weak signals) offered by incidents are not being effectively processed. Thus, solely adopting a LFMA process will lead ultimately to an increased frequency of major events, higher average risks, larger damage costs, and more business interruptions.

An efficient LFI process will make use of multiple opportunities for learning leading to lower risk, a more stable business environment as the organization makes small, optimizing adjustments in response to the learning from incidents. Here there is an opportunity to update the operator’s checklists to look for reoccurrence of similar events and report them to upper channels for investigation and corrective actions.

In his book ‘Thinking Fast and Slow’, psychologist Daniel Kahneman states that estimation of the likelihood of occurrence of a major incident is higher than the actual number immediately following said incident, but then reduces to be below that number progressively over time. An analogy would be a hazard that recedes in your car’s rear-view mirror as you drive away from it.

Figure 3. LFI vs LFMA

In the Fertilizer Industry Operational Risks Database, or FIORDA, we introduced a special classification called “Early Warning Signs” where we document the signs that preliminary may lead to a Major Accident or trigger near misses; e.g. frequent leaks, gradual increase of temperature or vibration, high pressure drop, abnormal color, etc. The purpose is to understand how Major Accidents and near misses can be prevented when understanding the process behavior in various conditions (start-up, shut-down, various types of plant upsets, abnormal downstream conditions, foaming, etc.)

Based on our industry records, a significant number of catastrophic incidents are occurring in primary reforming area during start-up. Explosion of primary reformer box, tubes damages, catalysis poisoning are constantly occurring as documented in multiple cases presented at this conference.

Fig 4. Hot bands in reformer tubes

Fig 5. Burner mis-alignment

Fig 6. Hot bands in reformer tubes

Fig 7. Reformer tubes

Fig 8. Reformer after the blast

Fig 9. Various Reformer tubes failures

As the control system may be different from plant to plant, one common observation keeps coming up: the operator failed to make use of the checklist during the start-up or the start-up procedure does not include “how to” deal with a particular scenario. The operator either did not check the burners flame, tube color or fuel valves positions–all these contributing in the end to a very costly event. As mentioned by Johnson Matthey – at least once a year, a primary reformer suffers a complete or partial burn down and is caused by:

  • Deviation from standard operating procedures in terms of operating pressure, fuel rates, etc.
  • Increase in fuel gas calorific value.
  • Low feed rates at too high firing rates.

Fourteen years ago, one of the conclusions was “It is unfortunately the case that with increasing frequency history is repeating itself and the hard-won learning of the last 50 years has been lost and now is being relearned.” One of the recommendations was “checklists should be provided to cover unfamiliar tasks.”

Companies aware of this trend already have developed Practices for Improvements in Reformer Performance and Reliability. A key aspect is the reformer performance monitoring training that includes a background information on reformer reliability management, recognition and troubleshooting of reformer performance issues, reformer tube damage mechanisms and inspection, fitness-for-service and remaining life of reformer tubes, and how to recognize, troubleshoot and prevent common reformer tube failures. Practical training on tube and reformer visual inspection during operator rounds is discussed in the classroom and then practiced in the field by participants using specifically prepared checklists.

Demonstrating a correlation and postulating causation between appropriate use of checklists and a reduction in PS outages

Safety efforts pay. This has been noticed 30 years ago by Norsk Hydro when they analyzed all the relevant incidents that have been recorded in the monthly Safety Report for the period 1986-1993.

The analysis shows that about 50% of the incidents could have been avoided through closer inspection of the plants in order to identify maintenance needs and to run more stringent checks on maintenance work (before, during and after activities). Measures taken during operations including use of possible identification method which could have prevented the incident (such as HAZOP, checklists, inspections, and so on) could have prevented 22% of the incidents, with Safe Job Analysis representing 12%.

The benefits can be seen clearly from the survey of expense saving Norsk Hydro has achieved since 1985 (see Table 1).

1984-1988 1990-1994
Fatal Accident Rate (per 100 million hours worked), FAR=3.9 in the period 1984-1988. Fatal Accident Rate

(per 100 million hours worked),

FAR=0,35 in the period 1990-1994.

Average of 2 fatal accidents per year for Hydro employees. Average of 1 fatal accident every 5 years for Hydro employees.
1,500-1,600 lost-time injuries. 365 lost-time injuries.

 

Average 15-17 days lost per injury. Direct cost reduction alone amounts to USD 5.8 million a year.
USD 20 million in direct material costs as a result of fires and explosions, lost production not included. USD 5.8 million in direct material costs as a result of fires and explosions.

Table 1. Norsk Hydro survey of expense results

Describing best practice in the use of Process Safety checklists

Nowadays, checklists are part of projects in the fertilizer industry from early stages of design until commissioning and operation.

Design

In the design phase, best industry practice makes use of lessons learned from other projects and include questions in the documentation review that address aspects like:

Have the reports of similar plants, citing accidents and the improvements made to counteract them, been reviewed and incorporated into the design?

 Has the basic operating philosophy of the plant been reviewed and implemented, including the following:

  Normal/Alternative design cases?

  1. Turndown?
  2. Start-up?
  3. Emergency cases?
  4. Full or partial plant shutdown? 6. Is any material substitution made during the design stage been discussed and approved with the metallurgist and corrosion specialist?

 For revamp cases

The checklist during design review shall include aspects like:

If the design is a new unit installed in an existing complex or a revamp of an existing unit, have the following been addressed:

1.    Is the overpressure protection philosophy for the new equipment compatible with the existing?

2.    Is the design philosophy for new piping and equipment compatible with the existing or are additional overpressure protection devices required?

3.    Are the existing Utility/Auxiliary Systems compatible with the new equipment and is the capacity of these existing systems sufficient to meet the additional demand from the new unit?

4.    Are materials of construction for the new/revamped unit compatible with the existing complex?

Based on previous incidents, other questions to be included in the project checklist during the design phase are:

  Have the criteria for avoiding fatigue failures from acoustic and flow induced vibration in piping subject to high velocity and/or large pressure drops been checked? If flow or acoustic vibration is considered a possible problem, have the proper support guidelines been specified?

  1. If air or oxygen can enter the process, has the possibility of deflagration been reviewed and the system designed accordingly?
  2. Where a runaway reaction is possible, has this risk been adequately addressed in establishing the design? (e.g., materials of construction, relieving loads, trips and interlocks, etc.)
  3. Has overfilling of equipment been considered?
  4. Have all potential problems (e.g., corrosion, erosion, reactivity, improper mixing, vibration, etc.) caused by injection points been addressed? How?
  5. Are all potential problems (e.g., corrosion, erosion, reactivity, improper mixing, vibration, etc.) included and discussed in the HAZOP studies?

Pre-commissioning

Pre-commissioning procedures shall include checklists with questions related to equipment testing:

  1. When equipment is planned to be included in the piping system during testing, is it ensured that piping test pressure is compatible with equipment design pressure?
  2. For pneumatic testing, has maximum stored energy been evaluated?
  3. Has the procedure for pressure testing existing lines been agreed with EPC Contractor (this should be part of tie-in coordination meeting)?
  4. Has the procedure for pressure testing been determined for systems containing high pressure, high temperature piping welded to vessels (e.g. high-pressure steam systems)?
  5. Are there any national or local regulations prohibiting/limiting pneumatic testing?

Commissioning

In order to avoid any safety risk, a Pre-Start-up Safety Review (PSSR) shall be performed using a certain formal procedure. It is important that at this stage of the project, the HAZOP actions have been closed, all P&IDs comments are included in the final revision and updated to as-built status.

The checklist shall address the Mechanical Completion status, Instrumentation and Control loops check and ESD function checks.

As per previous experience, the fact that trip logic checks and PSSRs have detected incorrect conditions shows that they are needed. The tools should not only rest with the construction company, but with the technology provider who should apply them because the technology provider:

  • knows where in (pre-)commissioning that risks could arise
  • is the originator of the reference documents such as trip and interlock descriptions

Operation

Over time, plants undergo modification as a result of continuous improvement programs or revamp projects. New equipment is installed along with piping modification and new control loops. As the projects are handed over to operation, new SOPs (Standard Operating Procedures) should be developed together with new checklists for operators. However, often due to lack of time and resources, these are documents that are often remained left to be updated. In extreme cases, working instructions are passed over verbally from a senior operator to a junior and so on. We have seen cases where an SOP and checklist have been developed, proved efficient and have been presented officially during a training program, but then was abandoned by operators after one month since “the Senior Operator knows better”.

Updating the P&IDs and SOPs together with checklists are very often becoming actions and recommendations in the HAZOP studies.

Reviewing the evolution of checklists in related industries to determine which elements may be transferable

Even the most expert among us can gain from searching out the patterns of mistakes and failures and putting a few checks in place.

In medicine, a study performed to identify the impact of using the checklist showed that the rate of major complications for surgical patients in eight hospitals fell by 36 percent after introduction of the checklist. Deaths fell 47 percent. Infections fell by almost half. The number of patients having to return to the operating room after their original operations because of bleeding or other technical problems fell by one-fourth. Overall, in this group of nearly 4,000 patients, 435 would have been expected to develop serious complications based on earlier observation data. But instead just 277 did. Using the checklist had spared more than 150 people from harm and 27 of them from death.

Tom Wolfe’s book, “The Right Stuff,” tells the story of our first astronauts and charts the demise of the maverick test-pilot-culture of the 1950s. It was a culture defined by how unbelievably dangerous the job was. Test pilots strapped themselves into machines of barely controlled power and complexity, and a quarter of them were killed on the job. The pilots had to have focus, daring, wits, and an ability to improvise—the right stuff. But as knowledge of how to control the risks of flying accumulated—as checklists and flight simulators became more prevalent and sophisticated—the danger diminished, values of safety and conscientiousness prevailed, and the rock star status of the test pilots was gone.

Nowadays, before the pilots started the plane engines at the gate, they adhered to a strict discipline – the kind most other professions avoided. They ran through their checklists. They made sure they’d introduce themselves to each other and the cabin crew. They did a short briefing, discussing the plan for the flight, potential concerns, and how they’d handle troubles if they ran into them. And by adhering to this discipline—by taking just those few short minutes—they not only made sure the plane was fit to travel but also transformed themselves from individuals into a team, one systematically prepared to handle whatever came their way.

There is one element that we can learn from aviators and that is discipline.

We don’t know what we don’t know

.. there are known knowns; … We also know there are known unknowns; . . . But there are also unknown unknowns the ones we don’t know we don’t know. – Donald Rumsfeld

The ‘‘known knowns’’ are those issues that pose a hazard, have been identified, and have probably been addressed effectively including via a checklist used during design or operation.

The ‘‘known unknowns’’ are the focus of most hazards analyses. For example, a process may involve the transfer of chemicals from a high-pressure vessel to another vessel that operates at lower pressure. The designers of the system may not have considered the possibility that the pressure gradient could reverse, i.e., that the pressure in the first vessel could drop to a value less than that in the second vessel, thus creating the possibility of reverse flow. Such a scenario is a valid topic for the hazards analysis team to discuss. This scenario is a ‘‘known unknown’’ – it may not have been considered, but it is part and parcel of a normal hazard analysis.

It is the ‘‘unknown unknowns’’ that the operation and engineering team and particularly the team leader needs to give special consideration to. These situations require creative and imaginative thinking. In this context, another Rumsfeld quotation: “absence of evidence is not the evidence of absence” should be considered. Getting team members to ‘‘think the unthinkable’’ is one of a team leader’s biggest challenges. First, we all know the, ‘‘I’ve never seen it happen, therefore it can’t happen’’ syndrome. Second, these low probability scenarios usually involve the simultaneous occurrence of contingent events (which is why they occur so rarely). Once more, team members typically have trouble accepting and understanding unlikely combinations of events. To help overcome this block, the leader may choose to describe a number of real accidents that occurred elsewhere to show how ‘‘weird’’ they were – yet they happened.

Conclusion – Human Factor

Are the checklists the magic bullet for all our reliability and operational issues? The simple answer is No.

As much as we would like to take credit as safeguard for our operation procedures and “operators are trained to deal with this situation” the reality proves otherwise. During the stressful conditions of plant trips, sometimes on night shifts, overloaded with alarms and radio calls, even well-trained operators can make mistakes despite having the best documented checklist in hand.

Research indicates that for a complicated non-routine task, like restarting a plant to original process state, the error rate is very high at 0.1. For routine tasks where care is required, a simple task like resetting a valve after some related work has a physical operation error of 0.01 (equivalent with 1 error in 100 operation). However, there are of course many other scenarios which make errors much more likely, acting as a multiplier factor to original error probability. For example:

  • if operator is unfamiliar with a situation that is potentially important, but which is either novel or occurs only infrequently, the nominal error probability multiplies significantly.
  • If there is a shortage of time for error detection and correction, the probability of making mistakes also grows. For general high stress situation, the error rate is 0.25 meaning that 1 in 4 maneuvers may go wrong.

But, as Norsk Hydro demonstrated, intelligent application of checklists can be a life saver. Applying the latest development in automated systems and training their operators, including simulating non-routine tasks, considering previous industry incidents, the nominal error probability can be significantly reduced.

The safest future involves trained and confident operators applying dynamic checklists for significant non-routine operations.

References:
  1. The Checklist Manifesto – Atul Gawande
  2. PSM Lessons from a Major Reformer Furnace Failure – RDC Prior, SH Excellence CC, Johannesburg, South Africa, IChemE HAZARD Symposium Series No 159
  3. Implementation of Best Practices for Improvements in Reformer Performance and Reliability – Giuseppe Franceschini Process Owner Inspection, Yara International, James R. Widri, Manager, Advanced Engineering, Quest Integrity, LLC, AICHE Ammonia Safety Symposium 2015
  4. Dr David J. Smith, ‘Reliability and Maintainability and Risk’
  5. Process Risk and Reliability Management 2010, Ian Sutton.
  6. Safe Start-up of Ammonia / Urea Plants under Challenging Circumstances – Klaus Noelker ThyssenKrupp Industrial Solutions AG, AICHE Ammonia Safety Symposium 2014
  7. 99 Questions to Ask during a Design Review Session, akh.deocamdata.ro
  8. Common Problems on Primary Reformers – Bill Cotton and Peter Broadhurst, Johnson Matthey Catalysts, AICHE Ammonia Safety Symposium 2004
  9. Guidance on Learning from Incidents, Accidents and Events Edward Smith, Principal Consultant (DNV GL), Richard Roels, Senior Consultant (DNV GL), DNV GL, IChemE HAZARD Symposium Series No 160
  10. Commissioning Handbook; Plant Design and Operations and Operation Procedures book by Ian Sutton
  11. Safety Management System: Safety Performance Indicators and Results, Willy Bjerke Norsk Hydro, N-0240 Oslo, Norway, AICHE Ammonia Safety Symposium 1997
  12. Catastrophic Failure of Reformer Tubes at Courtright Ammonia Plant, Bhaskar Rani Terra Industries Courtright, Ontario, Canada, AICHE Ammonia Safety Symposium 2006
  13. Primary Reformer – Firebox Explosion New BMS and HAZOP Actions overview – Aurelien Flamme YARA, Tertre, Sammy Van Den Broeck YARA, Sluiskil, Wim Versteele YARA, YPO
Share this on:
LinkedIn
Twitter

[user_registration_form id=”41351″]