Database at scale - Introduction

·

5 min read

What is database at scale and how do we approach it in different databases types.

This series will cover the basics of scaling databases. Below are the topics we’ll explore:

  1. Introduction - You're here

  2. Basic Replication

  3. Multiple Masters

  4. Sharding and Proxies

  5. NoSQL with CassandraDB/ScyllaDB

A database server is generally a remote endpoint that listens for incoming read or write requests. These requests can be made using various methods, including REST, native libraries, and more.

In standard database cluster configurations, there is usually a single source of truth—a server that stores the data and manages both read and write operations.

To maintain high availability in a database deployment, it's recommended to have at least three sources of truth. This setup ensures that if one server fails, the others can continue to operate without interruption.

Terminology

When discussing databases at scale, it's essential to familiarize ourselves with some new terminology, which we’ll get to know in this part:

Replication

Databases play a crucial role in managing data, but storing all your data in a single location can be risky. This is where replication becomes essential.

Replication acts as a secondary source of truth, ensuring that an exact copy of the original data is maintained elsewhere.

Standby Server

Any data written to the main database server is replicated to the standby server, either synchronously or asynchronously.

The standby server remains inactive until it is needed. If the main database server crashes, the standby server takes over and becomes the new main database server.

When the original server is restored, it assumes the role of the new standby server.

Master / Slave

A master database server handles both read and write requests, while a slave database server is configured to handle read requests only.

When new data is written to the master, it is replicated to the slaves using one of two methods:

  • Synchronous replication: Data is committed to the master only when it can be successfully written to all slave instances at the same time, ensuring data consistency but potentially slowing down write operations.

  • Asynchronous replication: Data is immediately written to the master, with replication to the slaves occurring in the background. This method speeds up write operations but may result in temporary data inconsistency.

Multiple Masters

In this topology, each server can handle write requests, unlike the previous setup.

When a write request is made, it is propagated to the other servers either synchronously or asynchronously.

The key advantage of this topology is that any server can be used to write new data to the database, providing greater flexibility and convenience.

Sharding

While the previously mentioned solutions offer redundancy, they do not address a critical issue.

What happens when your data surpasses the capacity of a single server? Even with multiple slaves or masters, each holding the same data, you can only scale your database's read and write capabilities, not its storage capacity.

To resolve this limitation, you can implement sharding.

When sharding, each new row is directed to a specific database server based on predefined criteria.

In the example topology mentioned above, each database server contains identical tables with the same columns. For instance, if an ID starts with "az," the data will be inserted into the first database server. Similarly, if an ID starts with "ar," the data will be fetched from the last database server.

Proxy

Except for the standby servers, you'll need a method to access all of the servers as if they were a single unit.

In a master-master topology, where all servers handle both read and write requests, a simple Level 4 load balancer can evenly distribute the requests.

For master-slave and sharding topologies, you'll need a proxy. This can be set programmatically using code or through a service proxy.

Code-level

For example, you can encapsulate your database connector with a code wrapper that directs requests to the appropriate database server.

In a master-slave topology, you can place all the slaves behind a Level 4 load balancer and separate the requests by read or write operations. For instance, consider this pseudo code.

In a sharding topology, you would check the ID of the request to route it to the correct server. Here's an example.

Service-proxy

Using a code-level proxy can be beneficial but may be more challenging to maintain. For this reason, a service proxy can be a useful alternative.

Two popular service proxies are:

Mixing

It’s common to use a mix of the above topologies, for example:

  1. The first circle can be sharding topology.

  2. Then, each sharding is deployed using a master-master topology.

  3. Then, each of those masters has master-slave topology

NoSQL vs YesSQL

While the approaches mentioned above require significant effort, they are typically associated with Relational Databases (YesSQL), as NoSQL databases often come with these functionalities built-in.

The reason is due to YesSQL's main advantage, joins.

In YesSQL you don’t write any data twice - at least you should avoid it - instead, you’re creating foreign keys to reference data from other tables.

However, a challenge arises when you scale your database to multiple servers. For instance, if you want to join data from server 1 to server 2, each request to the database will multiply, especially with complex joins.

To address this issue in YesSQL databases, strict sharding is essential. This involves ensuring that all inserted data, including foreign key values, resides on the same physical server. However, achieving this strict sharding setup is not always achievable.

Going forward

In the upcoming sections of the series, we will delve into the fundamentals of deploying these kinds of topologies.

To make all the examples easy to understand in the coming articles we’ll use docker and docker-compose.