BackChannel
This section provides an overview of BackChannel.
Backchannel is an exclusive marketplace where only registered sellers and buyers can participate. This ensures a trusted environment where transactions are secure and participants are verified. Confidentiality is a priorityβsellers can choose to whom they show their inventory, ensuring that their participation remains private.
Non-Functional requirements
- High reliability.
- High availability with minimal latency.
- The system should be scalable and efficient.
Technology Stack
Blockquotes are useful for emphasizing key points or quoting external sources:
-
Programming Languages: Used for front and backend development, providing flexibility, ease of use, and scalability.
-
Frontend Framework: A modern frontend framework for building reactive, component-based user interfaces focusing on performance and simplicity.
-
Styling:: A utility-first CSS framework that provides pre-designed components and utility classes to build responsive and customizable UIs rapidly.
-
Mobile Framework: Flutter: A UI toolkit for building natively compiled applications for mobile, web, and desktop from a single codebase, providing fast development and expressive UIs.
-
Backend Framework: Nestjs: An open-source backend platform for building scalable and secure web and mobile applications, offering features such as authentication, database, file storage, and more.
-
Logging:
Winston: A versatile logging library for Node.js applications, providing support for various log levels, formats, and transports.
ELK Stack (Elasticsearch, Logstash, Kibana): A powerful stack for centralized logging and log analysis, offering features such as log storage, indexing, searching, and visualization.
- Monitoring:
Prometheus: An open-source monitoring and alerting toolkit designed for reliability and scalability, providing metrics collection, querying, and alerting capabilities.
Grafana: A platform for monitoring and observability, offering visualization, analytics, and alerting features for monitoring various aspects of the system.
- Infrastructure:
DigitalOcean: A cloud infrastructure provider offering scalable and reliable virtual servers (droplets) for hosting applications and services.
DigitalOcean Spaces S3 Object Storage: A scalable object storage service for storing and serving large amounts of unstructured data, such as images and media files.
CloudFlare: A global content delivery network (CDN) and security provider that improves website performance, security, and reliability by caching content and protecting against threats.
- Design Tools:
Figma: A collaborative interface design tool that allows designers to create, prototype, and collaborate on user interface designs in real time.
PenPot: A design and prototyping platform for creating interactive UI mockups and prototypes, providing tools for design collaboration and version control.
- Communication:
Discord: A messaging platform for teams to communicate and collaborate in channels, offering features such as chat, file sharing, and integration with other tools.
SendGrid: An email delivery and marketing platform that enables businesses to send transactional and marketing emails at scale, with features such as email templates, analytics, and deliverability optimization.
Twilio: A cloud communications platform that provides APIs for programmatically sending and receiving SMS messages, making phone calls, and more.
Estimation and Constraints
Let's start with the long-term estimation and constraints.
Traffic:
Let us assume we have 100 million daily active users (DAU) with 1 million buyers and on average our platform enables 10 million requests daily.
If on average each user performs 10 actions (such as searching for a product, joining an auction, etc.) we will have to handle 1 billion requests daily.
100 πππππππΓ10 πππ‘ππππ =1 πππππππ/πππ¦100 millionΓ10 actions=1 billion/day
What would be Requests Per Second (RPS) for our system?
1 billion requests per day translate into 12K requests per second.
1 πππππππ(24 βππ Γ3600 π ππππππ )=βΌ12πΎ ππππ’ππ π‘π /π πππππ(24 hrsΓ3600 seconds)1 billionβ=βΌ12Krequests/second
Storage:
If we assume each message on average is 400 bytes, we will require about 400 GB of database storage every day.
1 πππππππΓ400 ππ¦π‘ππ =βΌ400 πΊπ΅/πππ¦1 billionΓ400 bytes=βΌ400 GB/day
And for 10 years, we will require about 1.4 PB of storage.
400 πΊπ΅Γ10 π¦ππππ Γ365 πππ¦π =βΌ1.4 ππ΅400 GBΓ10 yearsΓ365 days=βΌ1.4 PB
Bandwidth:
As our system is handling 400 GB of ingress daily, we will require a minimum bandwidth of around 4 MB per second.
400 πΊπ΅(24 βππ Γ3600 π ππππππ )=βΌ5 ππ΅/π πππππ(24 hrsΓ3600 seconds)400 GBβ=βΌ5 MB/second
High-level estimate
Here is our high-level estimate:
- Daily active users (DAU): 100 million.
- Requests per second (RPS): 12K/s.
- Storage (per day): ~400 GB.
- Storage (10 years): ~1.4 PB.
- Bandwidth : ~5 MB/s.
What kind of database should we use?
While our data model seems quite relational, we don't necessarily need to store everything in a single database, as this can limit our scalability and quickly become a bottleneck.
We will split the data between different services, each owning a particular table. Then we can use a relational database such as PostgreSQL or a distributed NoSQL database such as MongoDB for our use case.
High-level design
Now let us do a high-level design of our system.
Architecture:
We will use microservices architecture since it will make it easier to scale and decouple our services horizontally. Each service will have ownership of its data model. Let's try to divide our system into some core services.
Customer Service:
This service handles customer-related concerns such as authentication and customer information.
Buyers Service:
This service handles buyer-related concerns such as authentication and buyer information.
Product Service:
This service will be responsible for product matching and quadtree aggregation.
Auction Service:
This service handles sale-related functionality in our system.
Payment Service:
This service will be responsible for handling payments in our system.
Notification Service:
This service will simply send push notifications to the users.
Analytics Service:
This service will be used for metrics and analytics use cases.
What about inter-service communication and service discovery:
Since our architecture is microservices-based, services will be communicating with each other as well. Generally, REST or HTTP performs well but we can further improve the performance using gRPC which is more lightweight and efficient.
Service discovery is another thing we will have to take into account. We can also use a service mesh that enables managed, observable, and secure communication between individual services.
Quadtrees:
A Quadtree is a tree data structure in which each internal node has exactly four children. They are often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. Each child or leaf node stores spatial information. Quadtrees are the two-dimensional analog of Octrees which are used to partition three-dimensional space.
quadtree:
Quadtrees enable us to search points within a two-dimensional range efficiently, where those points are defined as latitude/longitude coordinates or as cartesian (x, y) coordinates.
We can save further computation by only subdividing a node after a certain threshold.
quadtree-subdivision:
Quadtree seems perfect for our use case, we can update the Quadtree every time we receive a new location update from the driver. To reduce the load on the quadtree servers we can use an in-memory datastore such as Redis to cache the latest updates. And with the application of mapping algorithms such as the Hilbert curve, we can perform efficient range queries to find nearby sellers for the customer.
How to find the best products nearby:
Once we have a list of nearby products from the Quadtree servers, we can perform some sort of ranking based on parameters like average ratings, relevance, past customer feedback, etc. This will allow us to broadcast notifications to the best available seller first.
Dealing with high demand:
In cases of high demand, we can use the concept of Surge Pricing. Surge pricing is a dynamic pricing method where prices are temporarily increased as a reaction to increased demand and mostly limited supply. This surge price can be added to the base price of the product.
Payments: Handling payments at scale is challenging, to simplify our system we can use a third-party payment processor like Stripe or PIX. Once the payment is complete, the payment processor will redirect the user back to our application and we can set up a webhook to capture all the payment-related data.
Notifications: Push notifications will be an integral part of our platform. We can use a message queue or a message broker such as Apache Kafka with the notification service to dispatch requests to Firebase Cloud Messaging (FCM) or Apple Push Notification Service (APNS) which will handle the delivery of the push notifications to user devices.
Detailed design: It's time to discuss our design decisions in detail.
Data Partitioning: To scale out our databases we will need to partition our data. Horizontal partitioning (aka Sharding) can be a good first step. We can shard our database either based on existing partition schemes or regions. If we divide the locations into regions using let's say zip codes, we can effectively store all the data in a given region on a fixed node. But this can still cause uneven data and load distribution, we can solve this using Consistent hashing.
Metrics and Analytics: Recording analytics and metrics is one of our extended requirements. We can capture the data from different services and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing. Additionally, we can store critical metadata in the views table to increase data points within our data.
Caching: In a location services-based platform, caching is important. We have to be able to cache the recent locations of the customers and drivers for fast retrieval. We can use solutions like Redis or Memcached but what kind of cache eviction policy would best fit our needs?
Which cache eviction policy to use:
Least Recently Used (LRU) can be a good policy for our system. In this policy, we discard the least recently used key first.
How to handle cache miss:
Whenever there is a cache miss, our servers can hit the database directly and update the cache with the new entries.
Identify and resolve bottlenecks: BackChannel-advanced-design
Let us identify and resolve bottlenecks such as single points of failure in our design:
"What if one of our services crashes?"
"How will we distribute our traffic between our components?"
"How can we reduce the load on our database?"
"How to improve the availability of our cache?"
"How can we make our notification system more robust?"
To make our system more resilient we can do the following:
Running multiple instances of each of our services.
Introducing load balancers between clients, servers, databases, and cache servers.
Using multiple read replicas for our databases.
Multiple instances and replicas for our distributed cache.
Exactly once delivery and message ordering are challenging in a distributed system, we can use a dedicated message broker such as Apache Kafka or NATS to make our notification system more robust.