Postmortem Report

Oscar Angel
3 min readFeb 20, 2022

--

Issue Summary

From 15:00 to 15:30 (UTC-5), responses of the server with IP address 189.899.099.88 showed an 500 Internal Server Error. Web Pages and applications that rely on this server, returned errors. During this range of time the issue affected 100% of the traffic, where clients could not access the requests done to the server. The root cause of the breakdown was an invalid configuration change at some of the extentions of the .php in the wp-settings.php.

Timeline

15:00: Configuration push begins

15:00: Outage begins

15:05: Pages alerted teams

15:25: Succesfull configuration rollback

15:30: Server restarts again

15:30: 100% traffic back online

Root Cause

At 15:00 (UTC-5) an update of the files was inadvertently released to our production environment without testing it first. On the update, of the wp-settings.php one of the files had a typographical error in its extension, ending in .phpp instead of .php. This did not allowed the server to run the programs properly, showing an 500 Internal Server Error.

Resolution and Recovery

At 15:05 (UTC-5) the monitoring software alerted our engineers. After doing web stack debuging proecedure. They found some paths names in the wp-settings.php file, had an extension of .phpp instead of the .php. Following this, using Puppet, a program which replace all the *.phpp paths into *.php is created and executed. By 15:25 (UTC-5) the program is tested and at 15:30 (UTC-5) the server is restarted. Finally the server and its programs run correctly, returning 200 OK message when a services are requested, 100% of traffic is back online

Corrective and Preventive Measures

After the incident, the engineering team concluded that we could take as corrective and preventive measures, the following:

  • Last minute human errors, such as typographical errors, can be more common than expected. Then, making sure the program run properly in a testing environment before sending it to production, is crucial.
  • Also running the programs, within the same environment and program versions in the testing environment is a must. This, as a mechanism to prevent running errors in production due different program versions.
  • Finally, developing better alert mechanisms, can increase our fast response to fix the bugs in production.

Our company is committed to the continuous improvement and the engineering team is compromised to implement the corrective and preventive measures by the end of this weak. New, procedures are going to be implemented in order to not fall in the same mistakes again.

--

--