Forgetting this step will cause extra work on your primary recovery team as they take time to explain what is going on. If you are working with a service provider, this position might be alternately filled with an account or test manager.
Maintain an up-to-date access control list ACL specifying who, in both your company and your IT service provider if applicable , has access to your data center and resources therein.
Also specify which individuals can introduce guests to the data center. This will be useful for determining, in the event of an emergency scenario, who may be designated a point person for facilitating access to critical infrastructure.
During a recovey event your primary operations team is going to be busy recovering systems, so be sure you know who to contact and how to gain access to your data center. Examples are provided in the table below. Remove, replace and add individuals to this list as appropriate for your organization and infrastructure.
During any disaster event there should be a defined call tree specifying the exact roles and procedures for each member of your IT organization to communicate with key stakeholders both inside and outside of your company. When defining the call structure, limit your tree and branches to a ratio of caller to call recipient.
As a first step, for example, your Disaster Recovery Coordinator might call both the company CEO and head of operations, both of whom would then inform the appropriate contacts in their teams along with key customers, service providers, and other stakeholders responsible for correcting the service outage and restoring data and operations.
As you create your runbook, you must consider guidelines for declaring a disaster scenario. Guidelines that we recommend are specified in the chart below:. The use of technology can be incorporated into the declaration steps of a DR plan. Be sure not to declare on the first instance of an event unless it is completely understood that secondary instances of the event will result in increased damage to your customer or your business systems.
The table below details some standard practices to use in order to mitigate premature declarations. SLAs should be built in a manner that allows for some troubleshooting and system restoration prior to the need to declare a disaster. Also use this section to outline standard monitoring procedures along with associated thresholds. List all system monitors, what they do, their associated thresholds, associated alerts when those thresholds are met or exceeded, the individual s who receive the alerts, and the remediation steps for each monitor.
List event monitoring standards by defining thresholds for event types, durations, corrective actions to be taken once the threshold is met, and event criticality level. Use the following chart or a derivative thereof for your monitoring standards to specify event monitoring standards.
These event types memory, storage, network, ping check and IP check are categories of events for which you should list specific examples in this chart. List out your step-by-step procedures for responding to service issue alerts in this section. This section should list detailed procedures for issue management and escalation, when necessary, in the case of an unmet service objective.
Escalation procedures will vary by levels of operation and severity of the associated activities. At Evolve IP, for example, we categorize standard operating procedure interruptions in five levels 5 being the lowest severity, 1 the highest. Of course, these can and will differ among organizations. The following serves only as an example:.
Depending on the severity of the service interruption, your escalation procedures will vary by parties involved, response chain, response time and target resolution.
Recovery events necessitate the priority of data and business process restoration. At times, other non-critical standard operating procedures SOPs must be suspended.
During a recovery event, recovery operations should take precedent over inbound queries or tickets. Monitors and alerts should also be reviewed for suspension until recovery is complete. This is a best practice procedure to avoid flooding your network operations center NOC and support teams with bogus or bad alarms.
Change management policies should also be altered to expedite recovery procedures. For example, adding a new server or firewall rule in a standard environment might take one day once all necessary reviews and permissions are met. But during recovery operations, a standard firewall change should be expedited to support recovery operations.
Ticketing of work during recovery operations should be reviewed to ensure the necessity of any requested tasks. Non-critical tickets should be deferred and addressed once recovery procedures are complete.
Remember, the number one rule in recovery is: Recover! Get things back up and running whether in a workaround, failover or full restore state. That in mind, use this section to identify which standard operating procedures will be suspended in the event of a true emergency scenario one that would fall under your critical or fatal service interruption classifications.
List out specifications for change management, monitors and alerts, and problem and issue resolution during recovery procedures.
Certain non-critical standard operating procedures may be suspended, such as in the following situation:. This ticket would be responded to with a message that your organization is currently in a recovery operations cycle and your service ticket will be addressed as soon as technicians have completed the restoration work.
Your runbook content, up to this point, has addressed organizational points of concern. At this stage in your runbook you should have fully documented procedures in your company for issue management and escalation, criteria for evaluating and declaring an emergency scenario, and procedures for ensuring all key stakeholders and responsible parties are in communication and are ready and able to take the necessary steps to begin disaster recovery procedures.
From this point forward, the runbook will shift focus to system level procedures to address infrastructure and network level configurations, restoration steps, and system level responsibilities while in disaster recovery mode. Provide a detailed overview of your IT environment in this section, including the location s of all data center s , nature of use of those facilities e. Include an address and directions to each location.
Data centers and colocation facilities typically maintain strict entry protocol. Certain members of your organization will typically hold the appropriate credentials to enter the facility. This section will include instructions for recovery personnel to follow that lay out which infrastructure components to restore and in which order. It should take into account application dependencies, authentication, middleware, database and third party elements and list restoration items by system or application type.
Ensure that this order of restoration is understood before engaging in restore work. An example is provided below. The rest of the table should be filled out in the exact order that restoration procedures are to be completed.
This section should include systems and application specific typology diagrams and an inventory of elements that comprise your overall system. Include networking, web app middleware, database and storage elements, along with third party systems that connect to and share data with this system. You should lay out each of your systems separately and include a table for your network, server layout and storage layout.
Use this section to list instructions specifying the servers, directories and files from and to which backup procedures will be run. This should be the location of your last known good copy of production data. Listed by server, be sure that these monitors are put in place and activated as part of your restore activities. Restoring from a disaster should result in a mirror to your production environment even if scaled.
Monitors and alerts are a critical element to your production system. This matrix describes the participation by various roles to complete DR tasks or deliverables. Fill in the matrix below, specifying the roles for your company, your service provider if applicable and any other 3rd parties that will be involved in your disaster recovery tests.
Positions that will fill these roles and responsibilities will often include your DR coordinator, network engineer, database engineer, systems engineer, application owner, data center service coordinator, and your service provider. Identify the responsibilities of each of these roles in a disaster event, then map them onto a matrix of all activities associated with recovery procedures, as in the example table provided below.
Use this section to outline the steps necessary to respond to outage alerts and, subsequently, restore data from backup records. Include your order of backup operations in this section, including data dependencies based on organization of your data backups and troubleshooting steps. These processes will be followed in the event that a data recovery is necessary, including scenarios in which data is still running but a backup is needed, restoring data in a post-disaster event or restoring from a backup volume.
Make sure to update the template as you enhance your system architecture and identify new outage scenarios. Atlassian is an enterprise software company that develops products for software developers, project managers, and content management.
Visualize your infrastructure to better identify weaknesses and pinpoint places for refinement. Use this template to assess your change management performance and mitigate risk. Document the details of your experiment including your hypothesis, variations, and results. Close View this page in your language? All languages Choose your language. Open and close the navigation menu. Get it free.
It may also describe procedures for handling special requests and contingencies. An effective runbook allows other operators, with prerequisite expertise, to effectively manage and troubleshoot a system. Through runbook automation, these processes can be carried out using software tools in a predetermined manner.
In product development, the minimum viable product MVP is the product with the highest return on investment versus risk. An MVP is not a minimal product , it is a strategy and process directed toward making and selling a product to customers.
It is an iterative process of idea generation, prototyping, presentation, data collection, analysis and learning. In IT Process development, the minimum viable runbook MVR is the runbook with the highest return on valuable information versus time spent creating it.
It is a strategy and process directed toward making and implementing runbooks for your IT team. It is an iterative process of idea generation, prototyping, automation, presentation, information capturing, analysis and learning. Your teams are busy. There is little value in carving out time to have a dedicated team member build runbooks, design processes and gather information from previous successes when you have 24x7 operations that need their attention.
Incidents are inevitable , outages are unforeseeable, and frustration can quickly become an internal IT cultural norm. Thus, runbooks are important in providing your teams with contextual documents to support their efforts.
How do we help these teams to be successful in this very important, but not urgent, task of creating runbooks? How do we ensure that our teams are prepared for the next incident? How do we ensure consistency in our recovery processes? What actions are we taking to truly lower downtime? By having a record of what actions and activities have worked, you can then leverage the recorded actions that fixed the problem to build your Minimum Viable Runbook.
THE WHO : no, not the band In the case of needing to include additional team members, your runbook should include the specific teams or team members to contact if you need to escalate the incident.
This is where the rubber meets the road. This also reduces the stress accompanied with solving a complex problem at 4am. Your Runbooks then need to give Call-To-Action instructions, clickable links, viewable graphs live and static images of graphs for reference. This gives the incident team member instant context on not only how to fix the problem, but how to better understand what to look for.
0コメント