Skip to main content

RPO and RTO

Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the objectives that Temporal strives to meet for data durability (recovery point) and service restoration (recovery time) in the event of cloud outages. These objectives are high priority goals for Temporal Cloud, but are not a contractual commitment.

To achieve the lowest RPO and RTO, Temporal Cloud offers High Availability features that keep Workflows operational with minimal downtime.

When High Availability is enabled on a Namespace, the user chooses an "active" region (where processing happens) and a "replica" region (where processing will switch to in the event of a failure). If the active and replica are in the same cloud provider but different regions (e.g., AWS us-east-1 and AWS us-west-2), this is called Multi-region Replication. If the active and replica are in different cloud providers (e.g., AWS and GCP), this is called Multi-cloud Replication. If the active and replica are in the same region, this is called Same-region Replication. Temporal will always place the active and replica in different cells.

As Workflows progress in the active region, history events are asynchronously replicated to the replica. In case of an outage in the active region or cell, Temporal Cloud will failover to the replica so that existing Workflow Executions will continue to run and new Executions can be started.

The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which High Availability feature your Namespace has enabled:

  1. Availability zone outage: Temporal Cloud Namespaces and data are always replicated across three availability zones. The failure of a single availability zone is handled automatically by Temporal Cloud behind the scenes, with no potential for data loss, and little-to-no observable downtime to the end user.
    1. All Namespaces: Zero RPO and near-zero RTO.
  2. Cell outage: Temporal Cloud implements a cell architecture. Each cell contains the software and services necessary to host a Namespace. Occasionally, the cell can experience an outage due to uncaught software bugs or sub-vendor outages.
    1. Namespaces without High Availability: TODO pending Engineering response
    2. Namespaces with High Availability and Temporal-initiated failovers enabled* (Same-region Replication, Multi-region Replication, or Multi-cloud Replication): 1-minute RPO and 20-minute RTO
  3. Regional outage: On rare occasions, an entire region within a cloud provider will be degraded. Since Namespaces depend on the cloud provider's infrastructure, Temporal Cloud is not immune to these outages.
    1. Namespaces without Multi-region Replication or Multi-cloud Replication enabled: 8-hour RPO* and Unbounded RTO (dependent on how long the cloud region takes to recover)
    2. Namespaces with Multi-region Replication or Multi-cloud Replication and Temporal-initiated failovers enabled*: 1-minute RPO and 20-minute RTO
  4. Cloud-wide outage: An entire cloud provider has an outage across most or all regions. Since cloud providers strive to keep cloud regions de-coupled, these are the rarest outages of all. Still, they have happened in the past.
    1. Namespaces without Multi-cloud Replication: 8-hour RPO* and Unbounded RTO (dependent on how long the cloud provider takes to recover)
    2. Namespaces with Multi-cloud Replication and Temporal-initiated failovers enabled*: 1-minute RPO and 20 minutes or less RTO

A few notes on these goals:

  • "8-hour RPO" for Namespaces without the appropriate High Availability feature: Historically, regional outages have not led to data corruption or permanent data loss in data systems that are replicated across three availability zones. All Namespace data was available once the outage ended; affected Namespaces observed a recovery point of "zero" (no data loss). However, as a precaution, Temporal backs up all Namespaces in rolling 4-hour windows. Should a future outage cause permanent data loss in the underlying data system, these backups would meet an 8-hour RPO.

  • "Temporal-initiated failovers:" Also known as "automatic failovers," these failovers are initiated by Temporal's tooling and/or on-call engineers on Namespaces that have High Availability enabled. Temporal highly recommends keeping Temporal-initiated failovers enabled, which is the default for all Namespaces with High Availability features. Users can still trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. When Temporal-initiated failovers are disabled on a Namespace, Temporal's RTO for that Namespace is unbounded (it is dependent on how long the underlying outage lasts)

Minimizing the Recovery Point

Temporal has put extensive work into tools and processes that minimize the recovery point and achieve its RPO for Temporal-initiated failovers, including:

  • Best-in-class data replication technology that keeps the replica up to date with the active.

  • Monitoring, alerting, and internal SLOs on the replication lag across all Temporal Cloud Namespaces.

However, user actions on a Namespace can affect the recovery point. For example, suddenly "bursting" into much higher throughput than your Namespace has seen before could create a period of replication lag where the replica falls behind the active. For this reason, Temporal exposes the replication lag metric that you can monitor on your Namespace. This metric approximates the recovery point the Namespace would achieve in a "worst case" failure at that given moment. Temporal recommends monitoring your replication lag and alerting should it rise too high, e.g., above 1 minute.

Minimizing the Recovery Time

Temporal has put extensive work into tools and processes that minimize the recovery time and achieve its RTO for Temporal-initiated failovers, including:

  • History events are replicated asynchronously. This ensures that the Namespace can still run workflows in the active region even if there are networking blips or outages with the replica region.

  • Outages are detected automatically. We have extensive internal alerting to detect disruptions to Namespaces, and are ever improving this system.

  • Battle-tested Temporal Workflows that execute failovers of all Temporal Cloud Namespaces in a given region quickly.

  • Regular drills where we failover our internal Namespaces to test our tooling.

  • Expert engineers on-call 24/7 monitoring Temporal Cloud Namespaces and ready to assist should an outage occur.

To achieve the lowest possible recovery times, Temporal recommends that you 1. keep Temporal-initiated failovers enabled on your Namespace (the default), and 2. invest in a process to detect outages and trigger a manual failover. Users can trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. There are several benefits to combining a manual failover process with Temporal-initiated failovers:

  • You can detect outages that Temporal doesn't. In the cloud, regional outages never affect every service the same way. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other cloud infrastructure are disrupted. If you monitor each service in your critical path and alert on unusual

  • You can sequence your failovers in a particular order. Your cloud infrastructure probably contains more pieces than just your Temporal Namespace: Temporal Workers, compute pools, data stores, and other cloud services. If you manually failover, you can choose the order in which these pieces switch to the replica region. You can then test that ordering with failover drills and ensure it executes smoothly without data consistency issues or bottlenecks.

  • You can proactively failover more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before its known to be a true regional outage.

  • Even if you have robust tooling to detect an outage and trigger a failover, leaving Temporal-initiated failovers enabled provides a "safety net" in case your automation misses an outage. It also gives Temporal leeway to preemptively failover your Namespace if we detect that it may be disrupted soon, e.g., by a rolling failure that has impacted other Namespaces but not yours, yet.

Understanding Temporal's RTO vs. SLA

Temporal has both a Recovery Time Objective (RTO) and a Service Level Agreement (SLA). They serve complementary purposes and apply in different situations.

AspectRTOSLA
What is it?An objective, or high-priority goal, for the total time that an outage disrupts a Namespace.A contractual agreement that sets an upper bound on the service error rate, with financial repercussions.
How is it measured?The achieved "recovery time" is measured in terms of "minutes per outage."The achieved "service error rate" is measured in terms of "error rate per month."
How is the calculation performed?The achieved recovery time in a given outage is the total time between <when a disruption to a Namespace began> and <when the Namespace was restored to full functionalilty>, either after a failover to a healthy region or after the outage has been mitigated.Temporal measures the percentage of requests to Temporal Cloud that fail, and applies a formula to get the final percentage for the month.
Do partial degradations count?Most outages contain periods of partial degradation where some % of Namespace operations fail while the rest complete as normal. When they disrupt a Namespace, periods of partial degradation count in the calculation of the recovery time.Partial degradations only partially count for the service error rate calculation. A 5-minute window with a 10% error rate would count less than a 5-minute window with a 100% error rate.
What is excluded?For partial degradations, what counts as a "disruption to a Namespace" is subject to Temporal's expert judgment, but a good rule of thumb is a service error rate >=10%.We exclude outages that are out of Temporal's control to mitigate, e.g., a failure of the underlying cloud provider infrastructure that affects a Namespace without High Availability and Temporal-initiated failovers enabled. If a Namespace has the relevant High Availability feature and has Temporal-initiated failovers enabled, then Temporal can act to mitigate the outage and it does usually count against the SLA. Full exclusions on the SLA page.

The following examples illustrate the RTO and SLA calculations for different types of in a regional outage. These hypothetical Namespaces are based on actual Temporal Cloud performance in a real-world outage.

Suppose that region middle-earth-1 experienced a cascading failure starting at 10:00:00 UTC, causing various instances and machines to fail over time. Temporal's automatic failover triggered for all Namespaces and completed at 10:15:00 UTC.

  • Namespace 0 was in the region but its cell was not affected by the outage. The only downtime it had was for a few seconds during the failover operation. It experienced a near-zero Recovery Time, and its service error rate was neglible.

  • Namespace 1_A was in the region and its cell experienced a partial degradation that caused 10% of requests to fail in the first 5 minutes, 25% in the second five minutes, and 50% in the third five minutes. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( (1 - 10%) + (1 - 25%) + (1 - 50%) + 8925 * 100% ) / 8928 = 99.990%. (Note: there are 8928 5-minute periods in a 31-day month.)

  • Namespace 1_B was in the same cell as Namespace 2_A, so it also experienced a partial degradation that caused 10% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 * (1 - 10%) + 8927 * 100% ) / 8928 = 99.998%.

  • Namespace 2_A was in the region and its cell was fully network partitioned at the start of the outage, causing 100% of requests to fail. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( 3 * (1 - 100%) + 8928 * 100% ) / 8640 5-minute periods per month = 99.97%.

  • Namespace 2_B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods * 100% ) / 8640 5-minute periods per month = 99.99%.

All of the above Namespaces were in the affected region, but they achieved varying recovery times and service error rates.

  • Notice how Namespace 1_A and Namespace 2_A were both automatically failed over with the same recovery time but different service error rates. Notice how Namespace 2_B and Namespace 1_A happen to have the same service error rate but different recovery times. This illustrates how RTO and SLA can differ, even in the same outage. Both are valuable tools for Temporal Cloud users to measure the availability of their Namespaces.

  • Notice how the Namespaces that were manually failed over (Namespace 1_B and Namespace 2_B) achieved lower recovery times than the Namespaces that were automatically failed over (Namespace 1_A and Namespace 2_A). This illustrates how proactive, aggressive manual failover can achieve a better recovery time than automatic failover.