How do SLAs factor into cloud risk management?
While you may not have much control over the infrastructure used by cloud service providers, you’re not completely at their mercy when it comes to cloud risk management.
The following is an excerpt from The Official (ISC)2 Guide to the CCSP CBK, Second Edition, by Adam Gordon, CISSP-ISSAP, ISSMP, SSCP. This section from Domain 6 details different facets of cloud service-level agreements (SLAs) and explains the important role of an SLA in cloud risk management.
The cloud represents a fundamental shift in the way technology is offered. The shift is toward the consumerization of IT services and convenience. In addition to the countless benefits outlined in this book and those you may identify yourself, the cloud creates an organizational change (Figure 1).
It is important for both the CSP and the cloud customer to be focused on cloud risk management. The manner in which typical risk management activities, behaviors, processes, and related procedures are performed may require significant revisions and redesign. After all, the way services are delivered changes delivery mechanisms, locations, and providers -- all of which result in governance and risk-management changes.
These changes need to be identified from the scoping and strategy phases through the ongoing and recurring tasks, both ad hoc and periodically scheduled. Addressing these risks requires that the CSP and cloud customer's policies and procedures be aligned as closely as possible because cloud risk management must be a shared activity to be implemented successfully.
Risk profile
The risk profile is determined by an organization’s willingness to take risks as well as the threats to which it is exposed. The risk profile should identify the level of risk to be accepted, the way risks are taken, and the way risk-based decision making is performed. Additionally, the risk profile should take into account potential costs and disruptions should one or more risks be exploited.
To this end, it is imperative that an organization fully engages in a risk-based assessment and review against cloud-computing services, service providers, and the overall effects on the organization should it utilize cloud-based services.
Risk appetite
Swift decision making can lead to significant advantages for the organization, but when assessing and measuring the relevant risks in cloud-service offerings, it’s best to have a systematic, measurable, and pragmatic approach to cloud risk management. Undertaking these steps effectively enables the business to balance the risks and offset any excessive risk components, all while satisfying listed requirements and objectives for security and growth.
Emerging or rapid-growth companies will be more likely to take significant risks when utilizing cloud-computing services so they can be first to market.
Difference between the data owner and controller and the data custodian and processor
Treating information as an asset requires a number of roles and distinctions to be clearly identified and defined. The following are key roles associated with data management:
The data subject is an individual who is the focus of personal data.
The data controller is a person who either alone or jointly with other persons determines the purposes for which and the manner in which any personal data is processed.
The data processor in relation to personal data is any person other than an employee of the data controller who processes the data on behalf of the data controller.
Data stewards are commonly responsible for data content, context, and associated business rules.
Data custodians are responsible for the safe custody, transport, data storage, and implementation of business rules.
Data owners hold the legal rights and complete control over a single piece or set of data elements. Data owners also possess the ability to define distribution and associated policies.
Service level agreement
Similar to a contract signed between a customer and a CSP, the service level agreement (SLA) forms the most crucial and fundamental component of how security and operations will be undertaken. The cloud SLA should also capture requirements related to compliance, best practice, and general operational activities to satisfy each of these.
Within an SLA, the following contents and topics should be covered at a minimum for the best:
- Availability (for example, 99.99 percent of services and data)
- Performance (for example, expected response times versus maximum response times)
- Security and privacy of the data (for example, encrypting all stored and transmitted data)
- Logging and reporting (for example, audit trails of all access and the ability to report on key requirements and indicators)
- DR expectations (for example. worse-case recovery commitment, recovery time objectives [RTOs], maximum period of tolerable disruption [MPTD])
- Location of the data (for example, ability to meet requirements or consistent with local legislation)
- Data format and structure (for example, data retrievable from provider in read able and intelligent format)
- Portability of the dab (for example, ability to move data to a different provider or to multiple providers
- Identification and problem resolution (for example, help desk/service desk, call center, or ticketing system)
- Change-management process (for example, updates or new services)
- Dispute-mediation process (for example, escalation process and consequences)
- Exit strategy with expectations on the provider to ensure a smooth transition
Cloud SLA components
Although cloud SLAs tend to vary significantly depending on the provider, more often than not they are structured in favor of the provider to ultimately expose them to the least amount of risk. Note the examples of how elements of the SLA can be weighed against the customer’s requirements around cloud risk management and other operational needs (Figure 2).
Uptime Guarantees
Service levels regarding performance and uptime are usually featured in outsourcing contracts but not in software contracts, despite the significant business-criticality of certain cloud applications.
Numerous contracts have no uptime or performance service-level guarantees or are provided only as changeable URL links.
SLAs, if they are defined in the contract at all, are rarely guaranteed to stay the same upon renewal or not to significantly diminish.
A material diminishment of the SLA upon a renewal term may necessitate a rapid switch to another provider at significant cost and business risk.
- SLA penalties
For SLAs to be used to steer the behavior of a cloud services provider, they need to be accompanied by financial penalties.
Contract penalties provide an economic incentive for providers to meet stated SLAs. This is an important mechanism for cloud risk management and mitigation, but such penalties rarely, if ever, provide adequate compensation to a customer far related business losses.
Penalty clauses are not a form of risk transfer.
Penalties, if they are offered, usually take the form of credits rather than refunds. But who wants an extension of a service that does not meet requirements for quality? Some contracts offer to give back penalties if the provider consistently exceeds the SLA for the remainder of the contract period.
- SLA penalty exclusions
Limitation on when downtime calculations start: Some CSPs require that the application is down for a period of time (for example, 5 to 15 minutes) before any counting toward SLA penalty will start.
Scheduled downtime: Several CSPs claim that if they give you warning, an interruption in service does not count as unplanned downtime but rather as scheduled downtime and, therefore, is not counted when calculating penalties. In some cases, the warning can be as little as eight hours.
Cloud service contracts
Some cloud contracts state that if payment is more than 30 days overdue (including any disputed payments), the provider can suspend the service. This gives the CSP considerable negotiation leverage in the event of any dispute over payment.
Most cloud contracts restrict liability apart from infringement claims relating to intellectual property to a maximum of the value of the fees over the past 12 months. Some contracts even state as little as six months.
If the CSP were to lose the customer’s data, for example, the financial exposure would likely be much greater than 12 months of fees.
Most cloud contracts make the customer ultimately responsible for security, data protection and compliance with local laws. If the CSP is complying with privacy regulations for personal data on your behalf, you need to be explicit about what the provider is doing and understand any gaps as part of your cloud risk management strategy.
Cloud contracts rarely contain provisions about DR or provide financially backed RTOs. Some IaaS providers do not even take responsibility for backing up customer data.
Cloud security recommendations for SLAs
Gartner recommends negotiating SLAs for security, especially for security breaches, and has seen some CSPs agree to this. Immediate notification of any security or privacy breach as soon as the provider is aware is highly recommended.
Because the CSP is ultimately responsible for the organization’s data and alerting its customers, partners, or employees of any breach, it is particularly critical for companies to determine what mechanisms are in place to alert customers if any security breaches do occur and to establish SLAs determining the time frame the CSP has to alert you of any breach.
The time frames you have to respond within will vary by jurisdiction but may be as little as 48 hours. Be aware that if law enforcement becomes involved in a provider security incident, it may supersede any contractual requirement to notify you or to keep you informed.
These examples highlight the dangers of not paying sufficient focus and due diligence when engaging with a CSP around the SLA. Because these controls list a general sample of potential pitfalls related to the SLA, the following documents can serve as useful reference points when ensuring that SLAs are in line with business requirements. They can also balance risks that may previously have been unforeseen.
Key cloud SLA elements
The following key elements should be assessed when reviewing and agreeing to the SLA:
Assessment of risk environment: What types of risks does the organization face?
Risk profile: What are the number of risks and potential effects of risks?
Risk appetite: What level of risk is acceptable?
Responsibilities: Who will do what?
Regulatory requirements: Will these be met under the SLA?
Risk mitigation: Which mitigation techniques and controls can reduce risks?
Risk frameworks: What frameworks are to be used to assess the ongoing effectiveness? How will the provider practice cloud risk management?
Ensuring quality of service
A number of key indicators form the basis in determining the success or failure of a cloud offering. The following should form a key component for metrics and appropriate monitoring requirements:
Availability: This measures the uptime (availability) of the relevant services over a specified period as an overall percentage, that is, 99.99%.
Outage duration: This captures and measures the loss of service time for each instance of an outage, such as 1/1/201x -- 09:20 start -- 10:50 restored -- 1 hour 30 minutes loss of service.
MTBF: This captures the indicative or expected time between consecutive or recurring service failures -- that is, 1.25 hours per day of 365 days.
Capacity metric: This measures and reports on capacity capabilities and the ability to meet requirements.
Performance metrics: This utilizes and actively identifies areas, factors, and reasons for bottlenecks or degradation of performance. Typically performance is measured and expressed as requests or connections per minute.
Reliability Percentage metric: This lists the success rate for responses and is based on agreed criteria -- that is, 99% success rate in transactions completed to the database.
Storage Device Capacity metric: This lists metrics and characteristics related to storage device capacity it is typically provided in gigabytes.
Server Capacity metric: These look to list the characteristics of server capacity, based and influenced by central processing units (CPUs), CPU frequency in GHz, random access memory (RAM), virtual storage, and other storage volumes.
Instance Startup Time metric: This indicates or reports on the length of time required to initialize a new instance, calculated from the time of request by user or resource, and typically measured in seconds and minutes.
Response Time metric: This reports on the time required to perform the requested operation or tasks, typically measured based on the number of requests and response times in milliseconds.
Completion Time metric: This provides the time required to complete the initiated or requested task, typically measured by the total number of requests as averaged in seconds.
Mean-Time to Switchover metric: This provides the expected time to switch over from a service failure to a replicated failover instance. This is typically measured in minutes and captured from commencement to completion.
Mean-Time System Recovery metric: This highlights the expected time for a complete recovery to a resilient system in the event of or following a service failure or outage. This is typically measured in minutes, hours, and days.
Scalability Component metrics: This is typically used to analyze customer use, behavior, and patterns that can allow for the auto-scaling and auto-shrinking of servers.
Storage Scalability metric: This indicates the storage device capacity available if increased workloads and storage requirements are necessary.
Server Scalability metric: This indicates the available server capacity that can be utilized when changes in increased workloads are required.
CCSP® is a registered mark of (ISC)²