If you’re a software developer working on any kind of cloud platform it’s highly likely that someone has already mentioned the CAP theorem. If they haven’t already then, they will do soon for sure.
CAP theorem is also known as Brewers theorem after the computer scientist Eric Brewer.
Essentially CAP theorem is a law within computer science that states that it’s impossible for a distributed data store to simultaneously provide more than two of the following three guarantees:
· Partition tolerance
Now a lot has been written about CAP theorem and we’re not going to go into too much depth but it’s important to have an understanding of what this theorem is as it will allow you to recognize some of the constraints and trade offs that you’re going to encounter when designing your cloud system.
So let’s break down those three guarantees.
So the C in C-A-P stands for Consistency. By consistency, we mean that every read operation on the data store will return the most recent data.
Imagine an airline application where a user checks in for their flight and changes their seat. In a consistent system, as soon as any other user requests the seat map details, even a few milliseconds later, they’re going to see that seat as unavailable, they will see the most recent data.
Now if you’re familiar with ACID transactions in a relational database, there’s some cross over here with this idea of consistency, though they’re not exactly the same thing. But indeed, we might use database transactions to support the C in CAP theorem.
The A in CAP stands for availability. When we’re talking about availability within CAP theorem, we say that every request must receive a response.
Now we know we can never achieve perfect availability, but if we follow best practice then we can build availability into our system from the start and then we might hope to achieve a system where, for example, 99. 9% of the requests will receive a response.
Now with data stores in the cloud such as Azure SQL and Cosmos DB, high availability is built right in but the design of your architecture is still important. What happens if a data center goes offline, is that data replicated to another region?
The P in CAP stands for partition tolerance. Now you see the word partition crop up a lot in computer science. We’re not talking about a data partition here or a disk partition. In CAP Theorem the partition is a network partition. This is a partition between 2 parts of a system. Something that splits the network and causes parts of the system to become unavailable. In some ways, this isn’t dissimilar to the idea of graceful degradation. Expecting a system to remain responsive even when features we’d hope for aren’t there.
Imagine we have a system that’s composed of 4 components. A,B C and D.
Let’s say that nodes A and B can communicate, and nodes C and D can communicate but there is a problem which is stopping A and B from talking to C and D. Well, if we are partition tolerant then we will still allow A and B and C and D to continue to operate and do their work as normal.
While CAP theorem is specifically talking about network partitions, we’re not just talking about network-level errors here. We’re talking about any kind of failure that takes a component offline such as hardware failures or software maintenance.
Ok now what?
So now we know what the C A and P refer too but that’s only half of the story.
Trigger: A trigger is a stored procedure in a database that automatically invokes whenever a special event in the database occurs. For example, a trigger can be invoked when a row is inserted into a specified table or when certain table columns are being updated.
Triggers can be run BEFORE or AFTER the triggering statement.
It is the process of structuring a database, following a series of
Normal Forms to reduce data redundancy and improve data integrity.
Without Normalization in SQL, we may face many issues such as
- Insertion anomaly: It occurs when we cannot insert data to the table without the presence of another attribute
- Update anomaly: It is a data inconsistency that results from data redundancy and a partial update of data.
- Deletion Anomaly: It occurs when certain attributes are lost because of the deletion of other attributes.
Some examples of Normal Forms are 1NF, 2NF, 3NF, BCNF, etc.
DATABASE ISOLATION LEVELS
Database Isolation levels control the degree of locking that occurs when selecting data. This is to prevent reads and writes of temporary, aborted, or otherwise incorrect data written by concurrent transactions. For many database applications, the majority of database transactions can be constructed using isolation level priority rules, reducing the locking overhead for the system.
CAP Theorem : Scenarios explained
CAP theorem applies to behavior exhibited by the distributed system with respect to three attributes Consistency, Availability and Partition tolerance.
The theorem defines the choice of behavior that the system can exhibit from the end user’s perspective. It states that, at any given point of time, only 2 of the 3 behaviors can be guaranteed.
Lets look at what each of the 3 behavior individually mean:
- Consistency: The system shall always respond to the end user’s read request with consistent data.
- Availability: The system shall always respond to a user’s request to read/write.
- Partition Tolerance: The system will function even if the communication between the nodes drops i.e. partitions are created within the network. If this behavior is acceptable, then the system is tolerant to various partitions not interacting with each other and the system can continue to function with some sacrifices in quality of service.
For the below scenarios, let us consider that there are just two nodes which can hold some data. Node A is responsible for both read and write, whereas Node B is designed to be only read from.
Lets take a look at all the 3 combinations that are possible, lets start with the simplest.
Combination 1: Consistency & Availability.
In this combination, it is expected that the system is both consistent and available. It also means the system is resilient to inter node communications; and any partitions in the network is not tolerated.
When the writes happen to the Node A, then all the reads are suppose to get back the latest data consistently bounded by definition of consistency. Also the system is always available, all the read/write requests are honored without errors.
Combination 2 & 3:
These combinations are possible when the system is expected to function even when network partitions occur and system can tolerated it. I.e. even if the nodes are unable to communicate system will function.
Combination 2: Consistency and Partition tolerance:
When there are 2 or more partitions in the system, then the system can be designed to make one of the cluster/partition as the primary and have the other nodes/partitions dormant.
The primary cluster/partition is responsible and active for any data read and writes. When requests for reads are serviced by this primary node, the data is guaranteed to be latest and consistent. However if the read/write requests happen to go to nodes that are in dormant state, the system may not respond, in other words availability is not guaranteed.
When network is partitioned, then in order to keep the system consistent, partition 1 is made primary and the others are made dormant. So any request for read from system will get consistent data (if served by Node 1) or no data at all (if served by Node 2).
Combination 3: Availability and Partition tolerance:
When there are 2 or more partitions in the system, and the system can be designed to function as it was earlier. I.e. the nodes that were responsible for read and write will continue to do that; the nodes that were used for reads will continue to respond to end user’s read requests. So end user will always perceive the system to be available, however data reads highly depend on the node to which the read requests goes. As there is no inter-node communication, the data may be out of sync, and end user may perceive the system as inconsistent.
When network is partitioned, in order to keep the system available, any read request is serviced irrespective of whether the data is stale or not. So depending on the node that services the request the data may not be perceived consistent.
Note on consistency:
The definition for consistency still depends on the implementation that was designed in the system. E.g. Eventual, Session, Strong. For purpose of above discussion, if we say system is consistent, then it is bounded by the design of consistency in the system.
What is partition tolerance(In depth)?
Partition is the inability of two or more nodes in the network to communicate with each other.
If node 1 cannot receive messages from node 2, it means that there is a partition between them, and the message send from one node will not reach the other node.
Partition tolerance is saying that even though the connection between nodes is severed (or that there are partitions), the system should still be up and running.
What follows is a couple of different scenarios that prove systems can only have two out of the three properties.
Partition tolerant and available system
If we want the system to be partition tolerant and available, we will be sacrificing consistency.
We call the
write("Bye") operation and the call end up on node 1.
Node 1 tells other available nodes to update their state, however nodes 2 and 4 cannot receive updates (there’s a partition). The nodes are not aware of the updated state and keep the old value (
Later, if we call the
read operation and the call ends up on either node 2 or 4, we get back the value of
Hello. However, since this is not the recent state (as per consistency definition) which makes the system inconsistent, but partition tolerant and available.
Partition tolerant and consistent system
The second combination we are looking at is a partition tolerant and consistent system. In this case, we are losing the availability property.
We have the exact same scenario as before. We update node 1 and the value gets replicated to node 3, but not on nodes 2 and 4 as there’s a partition.
What is going to happen when
read operation is invoked and the call ends up on node 2 for example?
Node 2 cannot return any values, because it’s bound to consistency which says that every node provides the most recent state and will never return an outdated state. The way a certain node knows it has an outdated value is to ask other nodes for an updated state. If it doesn’t hear back from all nodes, it assumes it’s outdated and it won’t return a value.
Consistent and available system
In the final scenario we want our system to be consistent and available.
Same flow as before: we update node 1, the values get replicated to all nodes in the cluster. Later, we try to read from node 2 and we get the latest value.
Everything looks good!
However, as soon as there’s a network partition, nothing works anymore and we are forced into one of the previous two situations.
CA databases enable consistency and availability across all nodes. Unfortunately, CA databases can’t deliver fault tolerance. In any distributed system, partitions are bound to happen, which means this type of database isn’t a very practical choice. That being said, you still can find a CA database if you need one. Some relational databases, such as PostgreSQL, allow for consistency and availability. You can deploy them to nodes using replication.
CP databases enable consistency and partition tolerance, but not availability. When a partition occurs, the system has to turn off inconsistent nodes until the partition can be fixed. MongoDB is an example of a CP database. It’s a NoSQL database management system (DBMS) that uses documents for data storage. It’s considered schema-less, which means that it doesn’t require a defined database schema. It’s commonly used in big data and applications running in different locations. The CP system is structured so that there’s only one primary node that receives all of the write requests in a given replica set. Secondary nodes replicate the data in the primary nodes, so if the primary node fails, a secondary node can stand-in.
AP databases enable availability and partition tolerance, but not consistency. In the event of a partition, all nodes are available, but they’re not all updated. For example, if a user tries to access data from a bad node, they won’t receive the most up-to-date version of the data. When the partition is eventually resolved, most AP databases will sync the nodes to ensure consistency across them. Apache Cassandra is an example of an AP database. It’s a NoSQL database with no primary node, meaning that all of the nodes remain available. Cassandra allows for eventual consistency because users can resync their data right after a partition is resolved.
CAP theorem and microservices
Microservices are defined as loosely coupled services that can be independently developed, deployed, and maintained. They include their own stack, database, and database model, and communicate with each other through a network. Microservices have become especially popular in hybrid cloud and multi-cloud environments, and they are also widely used in on-premises data centers. If you want to create a microservices application, you can use the CAP theorem to help you determine a database that will best fit your needs.
When should Consistency or Availability be prioritized?
If you’re working with data that you know needs to be up-to-date, then it may be better to store it in a database that prioritizes consistency over availability. On the other hand, if it’s fine that the queried data can be slightly out-of-date, then storing it in an available database may be the better choice.
Notice that only write requests were discussed above. This is because read requests don’t affect the state of the data, and don’t require re-syncing between nodes. Read requests are typically fine during network partitions for both consistent and available databases.
Does Consistency in CAP mean Strong Consistency?
In a strongly consistent database, if data is written and then immediately read after, it should always return the updated data. The problem is that in a distributed system, network communication doesn’t happen instantly, since nodes/servers are physically separated from each other and transferring data takes >0 time. This is why it’s not possible to have a perfectly, strongly consistent distributed database. In the real world, when we talk about databases that prioritize consistency, we usually refer to databases that are eventually consistent, with a very short, unnoticeable lag time between nodes.
Acid, base, and cap
ACID and BASE represent two design philosophies at opposite ends of the consistency-availability spectrum. The ACID properties focus on consistency and are the traditional approach of databases. My colleagues and I created BASE in the late 1990s to capture the emerging design approaches for high availability and to make explicit both the choice and the spectrum. Modern large-scale wide-area systems, including the cloud, use a mix of both approaches.
Although both terms are more mnemonic than precise, the BASE acronym (being second) is a bit more awkward: Basically Available, Soft state, Eventually consistent. Soft state and eventual consistency are techniques that work well in the presence of partitions and thus promote availability.
The relationship between CAP and ACID is more complex and often misunderstood, in part because the C and A in ACID represent different concepts than the same letters in CAP and in part because choosing availability affects only some of the ACID guarantees. The four ACID properties are:
Atomicity (A). All systems benefit from atomic operations. When the focus is availability, both sides of a partition should still use atomic operations. Moreover, higher-level atomic operations (the kind that ACID implies) actually simplify recovery.
Consistency ©. In ACID, the C means that a transaction pre-serves all the database rules, such as unique keys. In contrast, the C in CAP refers only to single]copy consistency, a strict subset of ACID consistency. ACID consistency also cannot be maintained across partitions. partition recovery will need to restore ACID consistency. More generally, maintaining invariants during partitions might be impossible, thus the need for careful thought about which operations to disallow and how to restore invariants during recovery.
Isolation (I). Isolation is at the core of the CAP theorem: if the system requires ACID isolation, it can operate on at most one side during a partition. Serializability requires communication in general and thus fails across partitions. Weaker definitions of correctness are viable across partitions via compensation during partition recovery.
Durability (D). As with atomicity, there is no reason to forfeit durability, although the developer might choose to avoid needing it via soft state (in the style of BASE) due to its expense. A subtle point is that, during partition recovery, it is possible to reverse durable operations that unknowingly violated an invariant during the operation. However, at the time of recovery, given a durable history from both sides, such operations can be detected and corrected. In general, running ACID transactions on each side of a partition makes recovery easier and enables a framework for compensating transactions that can be used for recovery from a partition.
In its classic interpretation, the CAP theorem ignores latency, although in practice, latency and partitions are deeply related. Operationally, the essence of CAP takes place during a timeout, a period when the program must make a fundamental decision-the partition decision:
- cancel the operation and thus decrease availability, or
- proceed with the operation and thus risk inconsistency.
Retrying communication to achieve consistency, for example, via Paxos or a two-phase commit, just delays the decision. At some point the program must make the decision; retrying communication indefinitely is in essence choosing C over A.
Thus, pragmatically, a partition is a time bound on communication. Failing to achieve consistency within the time bound implies a partition and thus a choice between C and A for this operation. These concepts capture the core design issue with regard to latency: are two sides moving forward without communication?
This pragmatic view gives rise to several important consequences. The first is that there is no global notion of a partition, since some nodes might detect a partition, and others might not. The second consequence is that nodes can detect a partition and enter a partition mode-a central part of optimizing C and A.
Finally, this view means that designers can set time bounds intentionally according to target response times; systems with tighter bounds will likely enter partition mode more often and at times when the network is merely slow and not actually partitioned.
Sometimes it makes sense to forfeit strong C to avoid the high latency of maintaining consistency over a wide area. Yahoo’s PNUTS system incurs inconsistency by maintaining remote copies asynchronously. However, it makes the master copy local, which decreases latency. This strategy works well in practice because single user data is naturally partitioned according to the user’s (normal) location. Ideally, each user’s data master is nearby.
Facebook uses the opposite strategy: the master copy is always in one location, so a remote user typically has a closer but potentially stale copy. However, when users update their pages, the update goes to the master copy directly as do all the user’s reads for a short time, despite higher latency. After 20 seconds, the user’s traffic reverts to the closer copy, which by that time should reflect the update.
Choosing system needs
To some, the choice between consistency and availability is really a matter of philosophical discussion that’s rarely made in practice. The reliability of these distributed systems is pretty good. That said, problems do happen. AWS experienced a big outage just before Thanksgiving 2020.
Where the theory says you can have only two of three components, professionals say that’s not always the case. Eric Brewer, computer scientist and initial positor of the CAP theorem, cleared up some confusion around the theorem, generalizing it from a hard either/or statement to one depending on the system’s need. He said:
“The modern CAP goal should be to maximize combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations.”
Choosing consistency and availability comes when choosing which database type to go with, such as SQL vs NoSQL. NoSQL databases can be classified based on whether they support high availability or high consistency.
SQL databases like MySQL, PostgreSQL, Microsoft SQL Server, Oracle, etc. usually prioritize consistency. Master-slave replication is a common distributed architecture in SQL databases, and in the event of a master becoming unavailable, the role of master would failover to one of the replica nodes. During this failover process and electing a new master node, the database cannot be written to, so that consistency is preserved.
Popular Databases that Prioritize Consistency vs. Availability
NoSQL databases do not require a schema, and don’t enforce relations between tables. All its documents are JSON documents, which are complete entities one can readily read and understand. They are widely recognized for:
- Scalable performance
- Strong resilience
- Wide availability
Examples of NoSQL databases include:
- Cloud Firestore
- Firebase Real-time DB
- Amazon DynamoDB
Highly Available and Partition Tolerant systems
As mentioned before, AP systems will always (there is no 100% availability in IT, but you get the point) be there for you no matter what. The caveat with them is that the data may not always be updated everywhere.
Examples of these systems are:
- Cassandra DB. This is a distributed NoSQL database that can handle huge amounts of information. While it can be configured to be data-consistent, in doing so you’ll lose availability, thus shifting it from an AP system into a CP one. By default, its data consistency level puts it inside this category. Use it, as many other systems in this category, if you don’t mind reading stale data in some situations.
- MemoryDB. Recently AWS announced the launch of a Redis-based managed data store. They also have an eventual consistency model by replicating their transactions log amongst multiple availability zones. This is a great alternative if you’re looking for a reliable key-value store.
AP systems are strong and reliable, and they’re great options as long as you can deal with some stale data from time to time.
Consistent and Partition Tolerant databases
Now let’s talk about the other side of the coin (because remember, we’re leaving out AC systems here): CP systems.
These systems are not highly available (not by default at least), which leaves out any cloud-based version of them.
Examples of CP systems are:
- HBase. An open source NoSQL database modeled after Google’s Bigtable paper. It provides real-time access to big data (we’re talking billions of rows and records, as an example). By default, HBase is considered CP because it provides a very narrow and forced consistency model. Every write operation is routed through what they call “RegionServer” and there is only one of them. If that server fails, the cluster becomes unreachable while it comes back online. That also removes the “A” from Availability. Like all other alternatives listed here, it can be configured to become highly available at the expense of data consistency of course.
- MongoDB. Probably one of the most common NoSQL databases out there, MongoDB is a document-based database much like CouchDB (which also goes for a JSON format for their records). However, unlike Couch, Mongo is considered CP by default, meaning that it prefers to stay data-consistent in a distributed environment (because all read and write operations are by default served by the primary node on a replica set) and as long as half of the nodes of the replica set are connected to each other, if this doesn’t happen then no new master can be chosen and the set goes down.
MongoDB and the CAP theorem (CP)
MongoDB is a popular NoSQL database management system that stores data as BSON (binary JSON) documents. It’s frequently used for big data and real-time applications running at multiple different locations. Relative to the CAP theorem, MongoDB is a CP data store — it resolves network partitions by maintaining consistency, while compromising on availability.
MongoDB is a single-master system — each replica set can have only one primary node that receives all the write operations. All other nodes in the same replica set are secondary nodes that replicate the primary node’s operation log and apply it to their own data set. By default, clients also read from the primary node, but they can also specify a read preference that allows them to read from secondary nodes.
When the primary node becomes unavailable, the secondary node with the most recent operation log will be elected as the new primary node. Once all the other secondary nodes catch up with the new master, the cluster becomes available again. As clients can’t make any write requests during this interval, the data remains consistent across the entire network.
Cassandra and the CAP theorem (AP)
Apache Cassandra is an open source NoSQL database maintained by the Apache Software Foundation. It’s a wide-column database that lets you store data on a distributed network. However, unlike MongoDB, Cassandra has a masterless architecture, and as a result, it has multiple points of failure, rather than a single one.
Relative to the CAP theorem, Cassandra is an AP database — it delivers availability and partition tolerance but can’t deliver consistency all the time. Because Cassandra doesn’t have a master node, all the nodes must be available continuously. However, Cassandra provides eventual consistency by allowing clients to write to any nodes at any time and reconciling inconsistencies as quickly as possible.
As data only becomes inconsistent in the case of a network partition and inconsistencies are quickly resolved, Cassandra offers “repair” functionality to help nodes catch up with their peers. However, constant availability results in a highly performant system that might be worth the trade-off in many cases.
RDBMS(MySQL, Oracle, MS SQL Server, etc.)
It’s no brainer that all RDBMS are Consistent as all reads and writes go to a single node/server.
How about availability? You might say, it is one single server and hence a single point of failure. So, how it’s categorized under Availability?
As I said earlier CAP-Availability is not the same as day to day availability/downtime we talk about. In a single node system, there will not be any network partition hence if the node is up, it will always return success for any read/write operation and hence available.
What happens when you replicate these Relational Databases?
We can make such systems using any cluster manager systems like Zookeeper or etc.
So does this mean these replicated relational databases are Available?
Not entirely, let’s see how.
- If a leader disconnects from the cluster, it takes a few seconds to elect a new leader. So, definitely not an available system.
- A client can always disconnect from the leader due to network partition even if both client and leader node is running fine. Hence making it unavailable.
Note: The Second point mentioned above can be solved if the client applications also keep heartbeat of the leader and initiate leader election in case it’s not able to connect to the leader. Still definitely not easy to achieve in RDBMS :) It would just complicated to put such logic in client applications.
What about consistency when data is replicated?
- If the data is read and written from only master/primary node it’s always Consistent.
- If the read requests are sent to any of the secondary, we will lose consistency and might serve inconsistent data in case of network partition or say master takes time to replicate data.