System Design: Building & Scaling to 100 million Users

 

1 Single Server Setup......................................................................................................................... 2

1.1 Database.................................................................................................................................. 3

1.1.1 Database: SQL vs NoSQL..................................................................................................... 3

1.1.2 Data warehouse vs Data Lake.............................................................................................. 4

2 Understand bottlenecks.................................................................................................................. 5

3 Upscale: Multiple Server Setup........................................................................................................ 5

3.1 Vertical Scaling vs horizontal scaling........................................................................................... 5

3.1.1 Vertical Scaling.................................................................................................................... 5

3.1.2 Horizontal Scaling................................................................................................................ 6

3.2 Load balancer........................................................................................................................... 6

3.2.1 Load balancing algorithms................................................................................................... 7

3.3 Database.................................................................................................................................. 7

3.3.1 Database replication(Master/Slave).................................................................................... 7

3.3.2 Database partitioning/sharding........................................................................................... 8

3.4 Cache....................................................................................................................................... 8

3.4.1 Cache tier........................................................................................................................... 9

3.4.2 Considerations for using cache............................................................................................ 9

3.5 Content delivery network (CDN)............................................................................................... 9

3.6 Stateless web tier..................................................................................................................... 9

3.6.1 Stateful vs Stateless Architecture........................................................................................ 9

3.6.2 Stateless architecture....................................................................................................... 10

4 Global Multiple Data center........................................................................................................... 10

5 Traditional vs Cloud computing....................................................................................................... 11

5.1 Cost, Time, and Security.......................................................................................................... 11

5.2 Management cost................................................................................................................... 12

6 Message queue............................................................................................................................. 12

7 Logging, metric, automation........................................................................................................... 12

8 Summary....................................................................................................................................... 13

9 Reference..................................................................................................................................... 13

 

1 Single Server Setup

single server setup is where everything is running on one server: web server, app, database, cache, etc.

graphic

1.1 Database

1.1.1 Database: SQL vs NoSQL

- SQL(relational database) are the most popular database, SQL store data in tables and rows. We can perform join operations using SQL across different datable tables.

most popular SQL database are MySQL, Oracle database, PostgreSQL, etc.

 

- NoSQL(Non-relational database) group data into four categories: key-value stores, graph stores, column stores, and document stores. Join operations are generally not supported in non-relational databases.

most popular Non-relational database are CouchDB, Neo4j, Cassandra, HBase, Amazon DynamosDB, etc.

 

graphic

For most project, SQL database the best option because it been around for 40 years and it work well; However, No-SQL may be a good choice if:

- your application requires super-low latency.
- You need to store a massive amount of data.
- your data is unstructured, or you do not have any relational data.
- You only need serialize and deserialize data (XAML, JSON, YAML, etc)

graphic

 

1.1.2 Data warehouse vs Data Lake

 

DATA WAREHOUSE

DATA LAKE

DATA

Structured, processed

structured, semi-structured, unstructured, raw

SCALE

Scales to moderate volume at a high cost

Scales to huge volumes at low cost

PROCESSING

schema-on-write

schema-on-read

STORAGE

expensive for large data volumes

designed for low-cost storage

AGILITY

less agile, fixed configuration

Highly agile, configure and reconfigure as needed

SECURITY

mature

maturing

USERS

business professional

data scientist, data analytic

 

2 Understand bottlenecks

- How your system perform if they number of users increase by 10X, 100X, or 1000X?
- What if a component in your system failed?
- Is the database too slow and does it need some in-memory caching?
- Perhaps your system needs a load balancer and many machines behind it to handle the user request. Or maybe the data is so huge that you need to distribute your database on multiple machines. What are some of the downsides that occur from doing that?

3 Upscale: Multiple Server Setup

3.1 Vertical Scaling vs horizontal scaling

 - Vertical scaling is adding more power (CPU, RAM) to your existing machine, Horizontal scaling is add more machine into your pool of resources.

graphic

3.1.1 Vertical Scaling

Advantages of vertical scaling:

- Most of the software can easily take advantage of vertical scaling.
- Less power consumption than running multiple servers.
- Easy to manage and install hardware within a single machine.


Disadvantages of vertical scaling:

- Requires huge amount of financial investment.
- Greater risk of hardware failure causing bigger outages.
- Generally vendor lock-in and limited upgradeability in future.
- Low availability.
- There is hardware limit.

3.1.2 Horizontal Scaling

Advantages of horizontal scaling:

- Much lower cost than vertical scaling.
- Easier to run fault-tolerance.
- Ability to scale out as much as possible.
- High availability.


Disadvantages of horizontal scaling:

- Software has to handle all the data distribution and parallel.
- Limited number of software are available that can take advantage of horizontal scaling.
- Higher utility cost (electricity and cooling).

3.2 Load balancer

Load balancing is a technique to distributing incoming network traffic across a group of backend servers, also know as a server farm or server pool. The benefit of load balancing is reduced downtime, scalable, redundancy, flexibility, efficiency.

graphic

3.2.1 Load balancing algorithms

1) Round Robin: Requests are distributed across the group of servers sequentially.

2) Least Connections: A new request is sent to the server with the fewest current connections to clients. The relative computing capacity of each server is factored into determining which one has the least connections.

3) Least Time: Sends requests to the server selected by a formula that combines the

fastest response time and fewest active connections. Exclusive to NGINX Plus.

4) Hash: Distributes requests based on a key you define, such as the client IP address or

the request URL. NGINX Plus can optionally apply a consistent hash to minimize redistribution

of loads if the set of upstream servers changes.

5) IP Hash: The IP address of the client is used to determine which server receives the request.

6) Random with Two Choices: Picks two servers at random and sends the request to the

one that is selected by then applying the Least Connections algorithm (or for NGINX Plus

the Least Time algorithm, if so configured).

3.3 Database

3.3.1 Database replication(Master/Slave)

Database replication is create one or more copy of your database, so that all users share the same level of information. The result is a distributed database in which users can access data relevant to their tasks without interfering with the work of others.  The implementation of database replication for the purpose of elimination data ambiguity or inconsistency among users is known as normalization.
- All write actions (Create, Update, Delete) will be done to the "master", all the read actions will be done on the "slave".
- If any of "slave" go down, we can create new "slave" easily. If the "master" go down, one of the "slave" will promote to master.
graphic

3.3.2 Database partitioning/sharding

Partitioning of relational data usually refers to decomposing your tables either row-wise (horizontally) or column-wise (vertically).
graphic

3.4 Cache

Cache is the temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly.

3.4.1 Cache tier

Cache tier is a temporary data store layer, much faster than the database, it help improve system performance and reduce database workloads.

3.4.2 Considerations for using cache

3.5 Content delivery network (CDN)

CDN is a network of geographically dispersed servers used to deliver static content, it cache static content like images, video, CSS, JavaScript files, etc.

Thing to consider when use CDN:

- Cost: CDN are run by third-party providers, and we are charged for data transfers in and out of the CDN. Caching infrequently used data may provide little to no benefits.
- Expiration: setting an appropriate cache expiry is important. if the cache expiry time is too long, the content no longer fresh, if the cache expiry time is too short, it can cause repeat load of content from server to CDN.
- CDN fallback: how our website deal with the failure of CDN's server, clients should be able to detect the problem and request resource from original server.

3.6 Stateless web tier

State is the user session data, by move the user session data out of the web tier and store it in persistent storage such as SQL or NoSQL, we allow each web server in the cluster to access state data. This is called "stateless web tier"

3.6.1 Stateful vs Stateless Architecture

A stateful server remembers client data from one request to the next. A stateless server keeps no state information.

in stateful server, router will route user A to server 1, user B to server 2, and user C to server 3. The issue is that every request from same client must be route to the same server, this can be done with sticky sessions in most load balances; however this create overhead, adding or removing servers is much more difficult and it is challenge to handle server failure.
graphic

3.6.2 Stateless architecture

in stateless architecture, requests from a user can be sent to any servers, each server will access state data from shared data store.
graphic

4 Global Multiple Data center

To handle a large number of user globally, we can build several data center around the world, and users are GeoDNS-routed to closest data center.

Several key technique to consider for global data center:

1) traffic redirection: effective tools are needed to direct user to the nearest data center.
2) Data synchronization: users from different GeoDNS could use different local databases or caches.
graphic

5 Traditional vs Cloud computing

5.1 Cost, Time, and Security

 

Cloud Computing

Traditional Computing

Security

Someone else’s  data centers

In-house data center

Upfont cost

Low upfront infrastructure investment

High Upfront costs

Scalability

Scalable quickly

Slow Scaling

Efficient

Efficient resource ultilization

Lower efficiency

Cost

Pay-as-you-go

Higher Cost

 

5.2 Management cost

graphic

6 Message queue

A message queue is queue of message sent between applications, it includes a sequence of work that waiting to be processed.
the basic architecture of message include an "input services", called producer/publishers, create messages, and the other services or server, called consumers/subscribers, connect to the queue and perform actions defined by the messages.
graphic

7 Logging, metric, automation

Logging: monitoring error logs to identify errors and problems in the system.

Metrics:  use metric to gain an insight about the health status of the system, for example:

   - Host level metrics: CPU, Memory, Disk I/O

   - Aggregated level metric: the performance of entire database tier, cache tier, etc.

   - Key business metric: daily active users, retention, revenue, etc.

Automation: When a system gets big and complex, we need to use automation tools to improve productivity. Continuous integration is good practice.

8 Summary

To handle 100 million users and beyond:

- max out the vertical scaling: using the fastest and best CPU/RAM/SSD disks/RAM Disks.

- keep web tier stateless.

- build redundancy at every tier.

- cache data as much as possible.

- Support multiple data center around the world.

- Host static assets in CDN.

- Scale you data tier by partitioning/sharding.

- split tiers into individual services.

- monitor your system and use automation tools.

9 Reference

https://systeminterview.com/scale-from-zero-to-millions-of-users.php
https://medium.com/edureka/sql-vs-nosql-db-5d9b69ace6ac
https://www.thorntech.com/sql-vs-nosql/
https://www.redswitches.com/blog/difference-between-horizontal-vertical-scaling/
https://www.nginx.com/resources/glossary/load-balancing/
http://www.ines-panker.com/2019/08/03/scaling-1-to-10-000-users.html
http://www.ines-panker.com/2019/08/10/scaling-100-000-to-500-M-users.html
http://supplychaininstitute.com/data-lake-vs-data-warehouse-key-differences/
https://www.zaloni.com/resources/blog/why-smart-companies-are-complementing-their-data-warehouses-with-data-lakes/
https://www.researchgate.net/figure/Comparison-of-Cloud-Computing-models-in-terms-of-software-components-separation-of_fig2_327700356