system design(how to design tweet)

Catalog

  • Clarify the requirements
  • Capacity Estimation
  • System APIs
  • High-level System Design
  • Data Storage
  • Scalability

Step1: Clarify the requirements

Clarify requirements and goals of the system

  • Requirements
  • Traffic size(e.g. Daily Active User)

Nobody expect you do design a complete system in 30-40 mins

Discuss the functionalities, align with interviewers or components to focus

Type1: Functional Requirement

  1. Tweet
    • a. Create
    • b. Delete
  2. Timeline/Feed
    • a. Home
    • b. User
  3. Follow a user
  4. Like a tweet
  5. Search tweets
    ...

system design(how to design tweet)

system design(how to design tweet)

Type2: Non-Functional Requirement

  • Consistency
    • Every read receives the most recent write or an error
    • Sacrifice: Eventual consistency
  • Availability
    • Every request receives a response, without the guarantee that it contains the most recent write
    • Scalable
      • Performance: low latency
  • Partion tolerance(Fault Tolerance)
    • The system continues to operate despite an arbitrary number of messages being dropped by the network between nodes

Step2: Capacity Estimation

Assumption:
- 200 million DAU, 100 million new tweets
- Each user: visit home timeline 5 times; other user timeline 3 times
- Each timeline/page has 20 tweets
- Each tweet has size 280 bytes, matadatda 30 bytes
- per photo: 200kb, 20% tweets have images
- per video: 2mb, 10% tweets have video, 30% videos will be watched

Storage Estimate

  • Write size daily:
    • Text:
      • 100M new tweets*(280+30)bytes/tweet = 31GB/day
    • Image:
      • 4TB/day
    • Video:
      • 20TB/day
  • Total
    • 24TB/day

Bandwidth Estimate (Social Networking => read heavy)

Daily Read Tweets Volume:
- 200M * (5 home visit + 3 user visit) * 20 tweets/page = 32B tweets/day
Daily Read Bandwidth:

  • Text: 23B * 280bytes / 86400 = 100MB/s
  • Image: 14GB/s
  • Video: 20GB/s
  • Total: 35GB/s

Step3: System APIs

postTweet(userToken, string tweet)

deleteTweet(userToken, string tweetId)

likeOrUnlikeTweet(userToken, string tweetId, bool like)

readHomeTimeLine(userToken, int pageSize, opt string pageToken)

readUserTimeLine(userToken, int pageSize, opt string pageToken)

Step4: High-Level System Design:

  • post tweets

system design(how to design tweet)

  • user timeline(push/pull mode)

system design(how to design tweet)

system design(how to design tweet)

https://medium.com/@winapp/read-fast-with-fan-out-write-f25257117297

Home Timeline (cant d)

Fan out on write

  • Not efficient for users with huge amount of followers(like Taylor Swift)

system design(how to design tweet)

Hybrid Solution

  • Non-hot users:

    • fan out on write(push)
  • Hot users:

    • fan in on write(pull): read during timeline request from tweets cache, and aggregate with results from non-hot users

system design(how to design tweet)

Step5: Data Storage

system design(how to design tweet)

principles

  • SQL database:
    • e.g, user table
  • NoSQL database:
    • e.g, timelines
  • File system:
    • media file: image, audio, video

Step6: Scalability

  • Identify potential bottlenecks
  • Discussion solutions, focusing on tradeoffs
    • Data sharding
      • data store, cache
    • Load balancing
      • user <-> application server
      • application server <-> cache server
      • application server <-> db
    • Data caching
      • read heavy

Sharding

Why?

  • impossible to store/process all data in a single machine

How?

  • Break large tables into smaller shards on multiple servers

Pros

  • Horizontal scaling

Cons

  • Complexity(distributed query, resharding...)

Option 1: shard by tweets‘ creation time

Pros:

  • Limited shards to query

Cons:

  • Hot/Cold data issue
  • New shards fill up quickly

Option 2: Shard by hash(userId): store all the data of user on a single shard

Pros:

  • Simple
  • Query user timeline is straightforward

Cons:

  • Home timeline stall needs to query multiple shards
  • Non-uniform distribution of storage
  • Hot users
  • Availability

Option 3: Shard by hash(tweetId)

Pros:

  • uniform distribution
  • high availability

Cons:

  • need to query all shards in order to generate user/home timeline(cache solution)

Caching

Why?

  • social networks have heavy read traffic
  • queries can be slow and cosyly

How?

  • store hot/ precompuyed data in memory, reads can much faster

Timeline service

  • user timelinme: user_id -> {tweet_id}
  • home timeline: user_id -> {tweet_id}
  • tweets: tweet_id -> tweet

Topics:

  • caching policy
  • sharding
  • performance

ref

https://www.youtube.com/watch?v=PMCdWr6ejpw&list=PLLuMmzMTgVK4RuSJjXUxjeUt3-vSyA1Or&index=1

system design(how to design tweet)

上一篇:nginx02(正则表达式与location)


下一篇:Firewalld防火墙基础