System Design Interview(未完待续)

System Design

Overview

System Design

How to Design A Unique ID Generator In Distributed Systems

1. Understand the problem and establish design scope

  • IDs must be unique.
  • IDs are numerical values only.
  • IDs fit into 64-bit.
  • IDs are ordered by date.
  • Ability to generate over 10,000 unique IDs per second.

2. Propose high-level design and get buy-in

Option Figure Pros Cons
Multi-master replication image-20220904181751977 This solves some scalability issues because IDs can scale with the number of database servers 1. Hard to scale with multiple data centers.
2. IDs do not go up with time across multiple servers. (并发性不高)
3. It does not scale well when a server is added or removed.(不支持动态扩容)
UUID image-20220904182304706 1. Generating UUID is simple. No coordination between servers is needed so there will not be any synchronization issues.
2. The system is easy to scale because each web server is responsible for generating IDs they consume. ID generator can easily scale with web servers.
1. IDs are 128 bits long, but our requirement is 64 bits.
2. IDs do not go up with time.
3. IDs could be non-numeric.
Ticket Server image-20220904182623987 1. Numeric IDs.
2. It is easy to implement, and it works for small to medium-scale applications.”
Single point of failure. Single ticket server means if the ticket server goes down, all systems that depend on it will face issues. To avoid a single point of failure, we can set up multiple ticket servers. However, this will introduce new challenges such as data synchronization.(不是很理解,为啥非得搞成中心化的,去中心化不就好了嘛,相当于是把UUID的ID gen组件解耦,创建一个去中心化的Ticket Server。)
Twitter snowflake approach image-20220904182947119
image-20220904183259750

3. Wrap up

  • Clock synchronization. In our design, we assume ID generation servers have the same clock. This assumption might not be true when a server is running on multiple cores. The same challenge exists in multi-machine scenarios. Solutions to clock synchronization are out of the scope of this book; however, it is important to understand the problem exists. Network Time Protocol is the most popular solution to this problem. For interested readers, refer to the reference material.(时钟问题)

  • Section length tuning. For example, fewer sequence numbers but more timestamp bits are effective for low concurrency and long-term applications. (位数的调整)

  • High availability. Since an ID generator is a mission-critical system, it must be highly available.(高可用性)

How to Design A Web Crawler

How to Design A Chat System

How to Design YouTube

Let us look at some impressive statistics, demographics, and fun facts of YouTube in 2020.

  • Total number of monthly active users: 2 billion.
  • Number of videos watched per day: 5 billion.
  • 73% of US adults use YouTube.
  • 50 million creators on YouTube.
  • YouTube’s Ad revenue was $15.1 billion for the full year 2019, up 36% from 2018.
  • YouTube is responsible for 37% of all mobile internet traffic.
  • YouTube is available in 80 different languages.

1. Understand the problem and establish design scope

  • Ability to upload videos fast
  • Smooth video streaming
  • Ability to change video quality
  • Low infrastructure cost
  • High availability, scalability, and reliability requirements
  • Clients supported: mobile apps, web browser, and smart TV
  • Reduce CDN costs in deep dive when cloud CDN serves a video

2. Propose high-level design and get buy-in

High-level design for the video uploading:

image-20220814114734986

  1. Videos are uploaded to the original storage.

  2. Transcoding servers fetch videos from the original storage and start transcoding.

  3. Once transcoding is complete, the following two steps are executed in parallel:

    1. Transcoded videos are sent to transcoded storage.
    2. Transcoded videos are distributed to CDN.
    3. Transcoding completion events are queued in the completion queue.
    4. Completion handler contains a bunch of workers that continuously pull event data from the queue.
    5. 3b.1.a. and 3b.1.b. Completion handler updates the metadata database and cache when video transcoding is complete.
  4. API servers inform the client that the video is successfully uploaded and is ready for streaming.

Whenever you watch a video on YouTube, it usually starts streaming immediately and you do not wait until the whole video is downloaded. Downloading means the whole video is copied to your device, while streaming means your device continuously receives video streams from remote source videos. When you watch streaming videos, your client loads a little bit of data at a time so you can watch videos immediately and continuously.

3. Design deep dive

Video transcoding architecture

image-20220814120325477

Processor:

  1. Video splitting. Video stream is split or further split into smaller Group of Pictures (GOP) alignment. GOP is a group/chunk of frames arranged in a specific order. Each chunk is an independently playable unit, usually a few seconds in length.
  2. Some old mobile devices or browsers might not support video splitting. Preprocessor split videos by GOP alignment for old clients.
  3. DAG generation.
  4. Cache data. The preprocessor is a cache for segmented videos. For better reliability, the preprocessor stores GOPs and metadata in temporary storage. If video encoding fails, the system could use persisted data for retry operations.

DAG scheduler:

  1. The DAG scheduler splits a DAG graph into stages of tasks and puts them in the task queue in the resource manager.

image-20220814121400424

Resource manager:

image-20220814121621547

The resource manager works as follows:

  • The task scheduler gets the highest priority task from the task queue.
  • The task scheduler gets the optimal task worker to run the task from the worker queue.
  • The task scheduler instructs the chosen task worker to run the task.
  • The task scheduler binds the task/worker info and puts it in the running queue.
  • The task scheduler removes the job from the running queue once the job is done.

Task workers:

Task workers run the tasks which are defined in the DAG. Different task workers may run different tasks

System optimizations

  1. Speed optimization: parallelize video uploading

image-20220814122254906

  1. Speed optimization: place upload centers close to users
  2. Speed optimization: parallelism everywhere

image-20220814122341145

  1. Safety optimization: pre-signed upload URL
  2. Safety optimization: protect your videos (DRM、AES、Visual watermarking)
  3. Cost-saving optimization:
    1. Only serve the most popular videos from CDN and other videos from our high capacity storage video servers.(CDN上保留热度比较高的)image-20220813172234429
    2. For less popular content, we may not need to store many encoded video versions. Short videos can be encoded on-demand. (热度一般的按需编码,没必要各种清晰度都需要)
    3. Some videos are popular only in certain regions. There is no need to distribute these videos to other regions. (热度比较高的按受欢迎的地区放置在CDN上)
    4. Build your own CDN like Netflix and partner with Internet Service Providers (ISPs). (自建CDN,和ISPs合作,拒绝中间商赚差价)

Error handling

two types of errors exist:

  • Recoverable error. For recoverable errors such as video segment fails to transcode, the general idea is to retry the operation a few times. If the task continues to fail and the system believes it is not recoverable, it returns a proper error code to the client. (可恢复错误,重试后还报错,抛错)
  • Non-recoverable error. For non-recoverable errors such as malformed video format, the system stops the running tasks associated with the video and returns the proper error code to the client.(不可恢复错误,抛错)

4. Wrap up

还有哪些可以扩展的点:

  1. Scale the API tier(API Server是无状态的,可以考虑一下水平扩展)
  2. Scale the database(数据库方面考虑分片和副本)
  3. Live streaming(考虑实时流和非实时流的差异)
  4. Video takedowns(视频审核,有些违规内容可以在上传视频阶段就发现并打回)

How to Design Google Drive

1. Understand the problem and establish design scope

functional requirements:

  • Add files. The easiest way to add a file is to drag and drop a file into Google drive.
  • Download files.
  • Sync files across multiple devices. When a file is added to one device, it is automatically synced to other devices.
  • See file revisions.
  • Share files with your friends, family, and coworkers
  • Send a notification when a file is edited, deleted, or shared with you.

non-functional requirements:

  • Reliability. Reliability is extremely important for a storage system. Data loss is unacceptable.

  • Fast sync speed. If file sync takes too much time, users will become impatient and abandon the product.

  • Bandwidth usage. If a product takes a lot of unnecessary network bandwidth, users will be unhappy, especially when they are on a mobile data plan.

  • Scalability. The system should be able to handle high volumes of traffic.

  • High availability. Users should still be able to use the system when some servers are offline, slowed down, or have unexpected network errors.

estimation:

  • Assume the application has 50 million signed up users and 10 million DAU.

  • Users get 10 GB free space.

  • Assume users upload 2 files per day. The average file size is 500 KB.

  • 1:1 read to write ratio.

  • Total space allocated: 50 million * 10 GB = 500 Petabyte

  • QPS for upload API: 10 million * 2 uploads / 24 hours / 3600 seconds = ~ 240

  • Peak QPS = QPS * 2 = 480

2. Propose high-level design and get buy-in

image-20220822225940709

  • User: A user uses the application either through a browser or mobile app.
  • Block servers: Block servers upload blocks to cloud storage. Block storage, referred to as block-level storage, is a technology to store data files on cloud-based environments. A file can be split into several blocks, each with a unique hash value, stored in our metadata database. Each block is treated as an independent object and stored in our storage system (S3). To reconstruct a file, blocks are joined in a particular order. As for the block size, we use Dropbox as a reference: it sets the maximal size of a block to 4MB
  • Cloud storage: A file is split into smaller blocks and stored in cloud storage.
  • Cold storage: Cold storage is a computer system designed for storing inactive data, meaning files are not accessed for a long time.
  • Load balancer: A load balancer evenly distributes requests among API servers.
  • API servers: These are responsible for almost everything other than the uploading flow. API servers are used for user authentication, managing user profile, updating file metadata, etc.
  • Metadata database: It stores metadata of users, files, blocks, versions, etc. Please note that files are stored in the cloud and the metadata database only contains metadata.
  • Metadata cache: Some of the metadata are cached for fast retrieval.
  • Notification service: It is a publisher/subscriber system that allows data to be transferred from notification service to clients as certain events happen. In our specific case, notification service notifies relevant clients when a file is added/edited/removed elsewhere so they can pull the latest changes.
  • Offline backup queue: If a client is offline and cannot pull the latest file changes, the offline backup queue stores the info so changes will be synced when the client is online.

3. Design deep dive

Block Server

image-20220904163830928

  • A file is split into smaller blocks.

  • Each block is compressed using compression algorithms.

  • To ensure security, each block is encrypted before it is sent to cloud storage.

  • Blocks are uploaded to the cloud storage.

Metadata database

image-20220904164017846

  • User: The user table contains basic information about the user such as username, email, profile photo, etc.

  • Device: Device table stores device info. Push_id is used for sending and receiving mobile push notifications. Please note a user can have multiple devices.

  • Namespace: A namespace is the root directory of a user.

  • File: File table stores everything related to the latest file.

  • File_version: It stores version history of a file. Existing rows are read-only to keep the integrity of the file revision history.

  • Block: It stores everything related to a file block. A file of any version can be reconstructed by joining all the blocks in the correct order.

Upload flow

image-20220904164942842

  • Add file metadata.
    1. Client 1 sends a request to add the metadata of the new file.
    2. Store the new file metadata in metadata DB and change the file upload status to “pending.”
    3. Notify the notification service that a new file is being added.
    4. The notification service notifies relevant clients (client 2) that a file is being uploaded.
  • Upload files to cloud storage.
    1. Client 1 uploads the content of the file to block servers.
    2. Block servers chunk the files into blocks, compress, encrypt the blocks, and upload them to cloud storage.
    3. Once the file is uploaded, cloud storage triggers upload completion callback. The request is sent to API servers.
    4. File status changed to “uploaded” in Metadata DB.
    5. Notify the notification service that a file status is changed to “uploaded.”
    6. The notification service notifies relevant clients (client 2) that a file is fully uploaded.

Download flow

image-20220904165327300

  1. Notification service informs client 2 that a file is changed somewhere else.
  2. Once client 2 knows that new updates are available, it sends a request to fetch metadata.
  3. API servers call metadata DB to fetch metadata of the changes.
  4. Metadata is returned to the API servers.
  5. Client 2 gets the metadata.
  6. Once the client receives the metadata, it sends requests to block servers to download blocks.
  7. Block servers first download blocks from cloud storage.
  8. Cloud storage returns blocks to the block servers.
  9. Client 2 downloads all the new blocks to reconstruct the file.

Notification service

  • Long polling. Dropbox uses long polling.
  • WebSocket. WebSocket provides a persistent connection between the client and the server. Communication is bi-directional.

Even though both options work well, we opt for long polling for the following two reasons:

  • Communication for notification service is not bi-directional. The server sends information about file changes to the client, but not vice versa. (单向的)
  • WebSocket is suited for real-time bi-directional communication such as a chat app. For Google Drive, notifications are sent infrequently with no burst of data.(数据交互频率不是很高)

Save storage space

  • De-duplicate data blocks. Eliminating redundant blocks at the account level is an easy way to save space. Two blocks are identical if they have the same hash value. (还不是很理解???)
  • Adopt an intelligent data backup strategy. Two optimization strategies can be applied:
    • Set a limit: We can set a limit for the number of versions to store. If the limit is reached, the oldest version will be replaced with the new version. (一份文件的副本数有限)
    • Keep valuable versions only: Some files might be edited frequently. For example, saving every edited version for a heavily modified document could mean the file is saved over 1000 times within a short period. To avoid unnecessary copies, we could limit the number of saved versions. We give more weight to recent versions. Experimentation is helpful to figure out the optimal number of versions to save.(保留较为重要的副本)
  • Moving infrequently used data to cold storage. Cold data is the data that has not been active for months or years. Cold storage like Amazon S3 glacier is much cheaper than S3. (冷热分离)

4. Wrap up

we can upload files directly to cloud storage from the client instead of going through block servers. The advantage of this approach is that it makes file upload faster because a file only needs to be transferred once to the cloud storage. In our design, a file is transferred to block servers first, and then to the cloud storage. However, the new approach has a few drawbacks:

  • First, the same chunking, compression, and encryption logic must be implemented on different platforms (iOS, Android, Web). It is error-prone and requires a lot of engineering effort. In our design, all those logics are implemented in a centralized place: block servers. (不同端自己上传文件,需要花费更多成本)
  • Second, as a client can easily be hacked or manipulated, implementing encrypting logic on the client side is not ideal.(客户端自己做操作,会有安全问题)

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×