Notification system design is a complex topic that spans user experience, system architecture, and infrastructure engineering. There is much to think about: transactional versus marketing messages, multiple devices per user, regulatory compliance (GDPR, CAN-SPAM), and scaling to millions of notifications per day. A robust notification system must support multiple notification types—transactional, promotional, and system-generated alerts—while maintaining reliability, low latency, and user satisfaction.
This comprehensive guide covers notification system design from two critical perspectives: the user-facing experience and the technical architecture that powers it. Whether you're building a notification system from scratch, preparing for a system design interview, or evaluating build-versus-buy decisions, this article will help you understand the complete picture.
User Experience & Design Principles
Rethinking Email Notification System Design
Can anyone else relate to the overwhelming number of email notifications in their inbox?
That's because email notifications are still the default notification service in workplace tools.
Our team uses the collaborative design platform Figma, and every time a comment is posted there, I receive an email notification about it. But without enough context, it's not very meaningful or actionable. And that's just one of at least a half dozen work tools with which I interact daily.
Receiving notifications to my email inbox from every work app and tool never felt right to me, and I think that's because of the mindset that I have around email. When I'm in my inbox, I think more globally, and when I'm in a different workplace platform, I think more locally.
In other words, I need the right context to get the most out of my notifications. For example, I go into my inbox and see notifications from Figma, Notion, Google Docs, and Slack, and I have to reset my mind to think about those platforms specifically to really understand what the notification means to my workflow. When I'm IN Google Docs and I can see the comments alongside the text, it's less disorienting.
Another example: receiving multiple notifications within a couple of minutes from the same document means there's a chance you'll miss important updates and be less engaged. A dedicated bulk notification UI, integrated with a bulk notification service, allows users to select filters and compose messages to efficiently send bulk alerts to targeted user groups. This approach helps reduce notification overload and ensures important updates reach the right audience, improving overall engagement.
When MagicBell first brought me on to help design a notification system, I looked at what other systems were already out there. It was fascinating to see that even the most advanced notification systems seemed to be entirely geared around engagement, largely within apps like Facebook and LinkedIn. The in-app notification systems in many work and productivity tools were not that robust and relied mainly on email.
But then I understood why that's the case: Work tools are in the business of being work tools and they're not focused on efficient, customized notification system design as a primary feature.
As work activities and collaboration continue to spread across devices, locations, and time zones, a more robust notification system could make a significant difference to many of us. Here's how I began to orient myself around design and all its parts—visible and not.
The Bell & Notification Center Window
The most obvious building block of a web notification system is the bell icon (thanks to social apps), the popover modal, or even the full page where the notifications themselves live.
When looking at an individual notification, this is the short list of parts that we can expect it to have:
- Notification content
- A time stamp or time elapsed since it was received
- An icon or thumbnail to indicate who it's from (person image, brand logo, etc.)
- A visual status indication to immediately communicate whether it is unseen/seen, unread, or read
- A global action that can be taken on all notifications regardless of category: "set importance," "mark as new," "mark as read," "mute," "archive," "delete"

A notification bell and pop-up modal
We think that the distinction between a social app notification system and one for work tools is fundamentally different: engagement has nothing to do with it. In fact, the work software notification system is actively working to support the end user in sorting out the relevant information more efficiently in order to take action, while also seeking to ensure that fewer items fall between the cracks.
In order to think about email and push notification system design in a more structured way—and to identify their relevant actions—we've categorized them as follows:
Notification Type Taxonomy
Incoming Request or Message Notification
When it comes to work, the majority of emails or any type of incoming message boils down to some type of a request.
This incoming message could be an email, an SMS alert system or text message via Slack (or other messaging apps), or notifications from social media. Each of these communications is processed as a notification request, which is then formatted into a notification message and delivered through channels such as email, SMS, or push.
These incoming communications can be accompanied by responses or a series of comments related to the request, which are not part of the message thread itself since the receiving party handles the request.
Notifications for Action Taken
When an action is taken on a communication thread, employees often need to surface it to all users who have a stake in that particular thread. Each category of notifications will have a particular action and some global workflow actions, like the ability to unsubscribe from further notifications on that thread.
System Notifications
Since workplace software and tools are an integral part of productivity and workflows, notifications around changes, improvements, or possible disruptions within this tool are incredibly important to the end user. However, only a few of them are relevant at the moment the user receives them. So, we've looked at ways that allow system notifications to be noticed while also giving the user maximum control (user preference) over when they see them.
Account Management Notifications
Account management notifications—administrator-driven changes—are relevant to the user account itself, as well as to other departments like billing. Similar to system notifications, many of these will be informative but will not require action, so they'll rarely get to the top of the notification pile.
Marketing Notifications
These messages are about product updates, upgrades, and what's available for the user as their needs grow and change. These are the least relevant to the users' daily workflows, and our designs reflect this idea. The degree to which they choose to interact with these types of notifications will be entirely up to the user.
Acting on Notifications
We've paid close attention to notification system design so that we're providing the maximum amount of information possible while also enabling the end users to decide whether or not to act quickly.
Here's a bit more about the post-notification actions as they relate to the above notification types:
Communication Notifications, Push Notifications & Actions
- View in context
- Quick reply
- Reminder ("at a scheduled time," "snooze," or even could include logic "when X happens, do Y")
- Delegate
- Label (to organize)
- Set status (to indicate action or communicate changes to team members)
For example, long-form communication notification design would include a button that takes the user to the full notification itself, whereas a comment notification may allow the user to go straight into the chat to write back.
Additionally, all notifications which have an action would benefit from also having the ability to set reminder or snooze, set status, label, or delegate to a team member.
System, Account Management, & Marketing
These types of notifications generally don't require any action and are therefore not as important to user workflow. However, when an action is required, these notifications should make their way to the top.
System Architecture Components
Building a production-ready notification system requires careful architectural planning. While the user-facing experience is critical, the underlying infrastructure must handle high volumes, ensure reliability, and scale seamlessly. Here's how the major components work together in a scalable notification system design.
Core Architecture Overview
A robust notification system consists of several interconnected layers:
- API Gateway / Notification Service - Entry point for all notification requests
- Message Queue System - Buffers requests and enables asynchronous processing
- Notification Processor - Handles business logic, user preferences, and routing
- Channel Services - Specialized delivery services for each notification channel
- External Delivery Providers - Third-party services (APNs, FCM, SMTP, Twilio)
- Storage & Tracking - Database for notification history and delivery status
- Retry & Dead Letter Queue - Handles failed deliveries
1. Notification Service (API Gateway)
The entry point for all notification requests. It validates incoming requests, authenticates API calls, and routes notifications to appropriate channels. Think of this as the orchestrator that decides whether a notification goes via email, push, SMS, or in-app channels.
Key Responsibilities:
- Request validation and authentication
- Rate limiting enforcement (prevent abuse)
- Routing logic based on notification type
- Initial idempotency checks (prevent duplicates)
Example API Request:
POST /api/notifications
{
"user_id": "user_123",
"notification_type": "transactional",
"channels": ["push", "email"],
"priority": "high",
"content": {
"title": "Payment Received",
"body": "Your payment of $99 was processed successfully."
},
"metadata": {
"transaction_id": "txn_456"
}
}
2. Message Queue System
Critical for handling high volumes and ensuring reliability. Message queues decouple notification creation from notification delivery, allowing the system to handle traffic spikes gracefully.
Popular Choices:
- Apache Kafka: Best for high-throughput, event streaming (100k+ messages/sec)
- RabbitMQ: Great for complex routing logic and message prioritization
- AWS SQS: Managed solution with built-in retry mechanisms and dead letter queues
- Redis Streams: Lightweight option for moderate volumes with pub/sub patterns
Why Message Queues Are Essential:
- Asynchronous Processing: API responds immediately while notifications process in background
- Traffic Buffering: Handles traffic spikes without overwhelming downstream services
- Reliability: Messages persist until successfully processed (at-least-once delivery)
- Scalability: Easy to add more consumers to process messages faster
Queue Design Pattern:
API Gateway → [Priority Queue] → High Priority Processor
→ [Standard Queue] → Standard Processor
→ [Bulk Queue] → Batch Processor
3. Notification Processor (Business Logic Layer)
The brain of the notification system. This component:
- Fetches user notification preferences from the User Preferences Service
- Determines which channels to use (respect opt-outs, quiet hours, channel preferences)
- Applies notification templates and personalizes content
- Enforces rate limits per user and per channel
- Routes to appropriate Channel Services
User Preference Checks:
async function processNotification(notification) {
const user = await userPreferenceService.getPreferences(notification.user_id);
// Respect quiet hours
if (isQuietHours(user.timezone, user.quiet_hours)) {
if (notification.priority !== 'urgent') {
await scheduleForLater(notification, user.quiet_hours.end);
return;
}
}
// Filter channels based on user opt-outs
const allowedChannels = notification.channels.filter(channel =>
user.enabled_channels.includes(channel)
);
// Check rate limits
if (await isRateLimited(notification.user_id, allowedChannels)) {
await queueForBatch(notification);
return;
}
// Route to channel services
for (const channel of allowedChannels) {
await channelServices[channel].send(notification);
}
}
4. Channel Services (Delivery Layer)
Specialized services for each notification channel. Each channel has unique requirements, delivery mechanisms, and failure modes.
Push Notification Service
Integrates with:
- APNs (Apple Push Notification Service): For iOS devices
- FCM (Firebase Cloud Messaging): For Android devices and web push
- Web Push Protocol: For browser notifications
Key Challenges:
- Device token management (tokens expire, users uninstall apps)
- Platform-specific payload formats
- Certificate/key rotation for APNs
- Handling silent vs. alert notifications
Email Service
Connects with SMTP providers or email APIs:
- SendGrid, Mailgun, AWS SES, Postmark
Key Considerations:
- HTML vs. plain text rendering
- Deliverability and spam score optimization
- Bounce and complaint handling
- Unsubscribe link compliance (CAN-SPAM, GDPR)
SMS Service
Routes through SMS gateways:
- Twilio, AWS SNS, MessageBird
Key Challenges:
- Character limits (160 chars for SMS, 1600 for MMS)
- International delivery and carrier-specific issues
- Cost optimization (SMS is expensive at scale)
- Shortcode vs. long code vs. toll-free numbers
In-App Notification Service
Delivers real-time notifications within the application:
- WebSockets for real-time delivery
- Server-Sent Events (SSE) for one-way streaming
- Long polling as fallback
5. User Preferences Service
Stores and retrieves user notification preferences. This is critical for respecting user choices and avoiding notification fatigue.
Stored Preferences:
- Enabled channels (email, push, SMS, in-app)
- Quiet hours and timezone
- Notification frequency limits (max per hour/day)
- Category subscriptions (marketing, transactional, social)
- Language and localization preferences
Database Schema Example:
CREATE TABLE user_preferences (
user_id VARCHAR(255) PRIMARY KEY,
enabled_channels JSONB DEFAULT '["email", "push"]',
quiet_hours JSONB DEFAULT '{"start": "22:00", "end": "08:00"}',
timezone VARCHAR(50) DEFAULT 'UTC',
max_notifications_per_hour INTEGER DEFAULT 10,
subscribed_categories JSONB DEFAULT '["transactional"]',
locale VARCHAR(10) DEFAULT 'en-US',
updated_at TIMESTAMP DEFAULT NOW()
);
6. Notification Tracker & Analytics
Logs delivery status and provides visibility into notification effectiveness.
Tracked Metrics:
- Sent: Notification successfully handed off to delivery provider
- Delivered: Confirmation from provider (push delivered to device, email accepted by recipient server)
- Opened: User opened push notification or email
- Clicked: User clicked link in notification
- Failed: Delivery failed (invalid device token, bounced email, etc.)
Database Schema Example:
CREATE TABLE notification_logs (
id SERIAL PRIMARY KEY,
notification_id VARCHAR(255) UNIQUE NOT NULL,
user_id VARCHAR(255) NOT NULL,
channel VARCHAR(50) NOT NULL,
status VARCHAR(50) NOT NULL,
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
opened_at TIMESTAMP,
clicked_at TIMESTAMP,
failed_at TIMESTAMP,
failure_reason TEXT,
metadata JSONB
);
CREATE INDEX idx_user_status ON notification_logs(user_id, status);
CREATE INDEX idx_sent_at ON notification_logs(sent_at DESC);
7. Retry Mechanism & Dead Letter Queue
Handles failed deliveries with intelligent retry logic.
Retry Strategy (Exponential Backoff):
- 1st retry: After 1 minute
- 2nd retry: After 5 minutes
- 3rd retry: After 30 minutes
- 4th retry: After 2 hours
- After 4 failed attempts → Move to Dead Letter Queue
Dead Letter Queue (DLQ):
Failed notifications that exceed retry limits are sent to a DLQ for:
- Manual investigation
- Alerting operations team
- Identifying systemic issues (e.g., all emails to a domain are bouncing)
8. Notification Template Repository
Stores pre-defined templates for different notification types, allowing teams to maintain consistency and quickly customize content.
Template Structure:
{
"template_id": "payment_received",
"channels": {
"email": {
"subject": "Payment Received - {{amount}}",
"html_body": "<html>...</html>",
"text_body": "Your payment of {{amount}} was processed..."
},
"push": {
"title": "Payment Received",
"body": "Your payment of {{amount}} was successful."
},
"sms": {
"body": "Payment received: {{amount}}. Transaction ID: {{transaction_id}}"
}
},
"variables": ["amount", "transaction_id", "timestamp"]
}
9. Scheduled Notifications
Manages notifications that should be delivered at a specific time. Use cases include:
- Reminders (meeting in 15 minutes)
- Time-zone-aware notifications (send at 9 AM user's local time)
- Scheduled campaigns (product launch announcement)
- Digest emails (daily summary of activity)
Implementation:
- Option 1: Database polling (check every minute for due notifications)
- Option 2: Delayed message queues (Redis with ZADD + score as timestamp)
- Option 3: Cron-based scheduler (Kubernetes CronJobs, AWS EventBridge)
How Components Work Together
In a typical notification flow:
- Business service (e.g., payment processor) calls Notification API
- API Gateway validates request, assigns unique notification ID, returns 202 Accepted
- Message Queue receives notification event
- Notification Processor consumes from queue:
- Fetches user preferences
- Checks rate limits
- Applies notification template
- Determines delivery channels
- Channel Services send to external providers (APNs, FCM, SendGrid, Twilio)
- Notification Tracker logs delivery status
- Retry Mechanism handles failures
- Analytics Dashboard displays real-time delivery metrics
For teams building notification systems, this architecture provides reliability and scale. However, building and maintaining this infrastructure is complex and time-consuming. MagicBell's notification infrastructure handles all of these components out-of-the-box, letting you focus on your core product instead of notification plumbing.
Essential Design Patterns
Successful notification systems leverage proven design patterns to achieve modularity, maintainability, and scalability. Here are the core patterns every notification system should implement.
1. Observer Pattern (Publish/Subscribe)
Perfect for event-driven notifications. When an event occurs (comment posted, payment processed, deployment completed), interested subscribers receive notifications automatically.
How It Works:
- Publishers emit events without knowing who will consume them
- Subscribers register interest in specific event types
- Event Bus routes events to appropriate subscribers
Example: GitHub Comment Notifications
// Event Publisher
class CommentService {
async createComment(issueId, userId, content) {
const comment = await db.comments.create({issueId, userId, content});
// Publish event - doesn't know who's listening
eventBus.publish('comment.created', {
issueId,
commentId: comment.id,
authorId: userId,
content
});
return comment;
}
}
// Event Subscribers
eventBus.subscribe('comment.created', async (event) => {
// Notify issue author
const issue = await db.issues.findById(event.issueId);
await notificationService.send({
userId: issue.authorId,
type: 'comment_on_your_issue',
data: event
});
});
eventBus.subscribe('comment.created', async (event) => {
// Notify @mentioned users
const mentions = extractMentions(event.content);
for (const userId of mentions) {
await notificationService.send({
userId,
type: 'mentioned_in_comment',
data: event
});
}
});
eventBus.subscribe('comment.created', async (event) => {
// Notify thread participants
const participants = await db.comments
.where({issueId: event.issueId})
.distinct('userId');
for (const userId of participants) {
if (userId !== event.authorId) {
await notificationService.send({
userId,
type: 'activity_on_subscribed_thread',
data: event
});
}
}
});
Benefits:
- Decoupling: Publishers don't need to know about notification logic
- Extensibility: Easy to add new subscribers without modifying publishers
- Testability: Each subscriber can be tested independently
2. Factory Method Pattern
Enables creating different notification types dynamically based on context. Essential for systems that support multiple notification categories, each with different formatting and delivery requirements.
Example: Multi-Channel Notification Factory
interface Notification {
send(): Promise<void>;
}
interface NotificationFactory {
createNotification(type: string, data: any): Notification;
}
class PushNotification implements Notification {
constructor(private userId: string, private title: string, private body: string) {}
async send() {
const deviceTokens = await getDeviceTokens(this.userId);
await fcm.sendMulticast({
tokens: deviceTokens,
notification: {
title: this.title,
body: this.body
}
});
}
}
class EmailNotification implements Notification {
constructor(private userId: string, private subject: string, private htmlBody: string) {}
async send() {
const user = await getUser(this.userId);
await emailService.send({
to: user.email,
subject: this.subject,
html: this.htmlBody
});
}
}
class SMSNotification implements Notification {
constructor(private userId: string, private message: string) {}
async send() {
const user = await getUser(this.userId);
await twilioClient.messages.create({
to: user.phoneNumber,
body: this.message
});
}
}
class NotificationFactoryImpl implements NotificationFactory {
createNotification(type: string, data: any): Notification {
switch(type) {
case 'push':
return new PushNotification(data.userId, data.title, data.body);
case 'email':
return new EmailNotification(data.userId, data.subject, data.htmlBody);
case 'sms':
return new SMSNotification(data.userId, data.message);
default:
throw new Error(`Unknown notification type: ${type}`);
}
}
}
// Usage
const factory = new NotificationFactoryImpl();
const notification = factory.createNotification('push', {
userId: 'user_123',
title: 'New Message',
body: 'You have a new message from Alice'
});
await notification.send();
Benefits:
- Single Responsibility: Each notification class handles one channel
- Open/Closed Principle: Easy to add new notification types without modifying existing code
- Type Safety: Strong typing ensures correct data for each notification type
3. Chain of Responsibility Pattern
Routes notifications through a chain of handlers, each deciding whether to process, modify, or pass to the next handler. Particularly useful for implementing filters, validation, and priority-based delivery.
Example: Notification Processing Pipeline
interface NotificationHandler {
setNext(handler: NotificationHandler): NotificationHandler;
handle(notification: Notification): Promise<boolean>;
}
abstract class AbstractNotificationHandler implements NotificationHandler {
private nextHandler: NotificationHandler | null = null;
setNext(handler: NotificationHandler): NotificationHandler {
this.nextHandler = handler;
return handler;
}
async handle(notification: Notification): Promise<boolean> {
if (this.nextHandler) {
return this.nextHandler.handle(notification);
}
return true;
}
}
class UserPreferenceHandler extends AbstractNotificationHandler {
async handle(notification: Notification): Promise<boolean> {
const prefs = await getUserPreferences(notification.userId);
// Filter out disabled channels
notification.channels = notification.channels.filter(channel =>
prefs.enabledChannels.includes(channel)
);
if (notification.channels.length === 0) {
console.log('All channels disabled for user');
return false; // Stop processing
}
return super.handle(notification);
}
}
class QuietHoursHandler extends AbstractNotificationHandler {
async handle(notification: Notification): Promise<boolean> {
if (notification.priority === 'urgent') {
return super.handle(notification); // Skip quiet hours for urgent
}
const prefs = await getUserPreferences(notification.userId);
const userTime = getCurrentTimeInTimezone(prefs.timezone);
if (isWithinQuietHours(userTime, prefs.quietHours)) {
await scheduleForLater(notification, prefs.quietHours.end);
return false; // Stop processing, scheduled for later
}
return super.handle(notification);
}
}
class RateLimitHandler extends AbstractNotificationHandler {
async handle(notification: Notification): Promise<boolean> {
const count = await getNotificationCount(
notification.userId,
Date.now() - 3600000 // Last hour
);
const prefs = await getUserPreferences(notification.userId);
if (count >= prefs.maxNotificationsPerHour) {
if (notification.priority === 'urgent') {
return super.handle(notification); // Bypass rate limit for urgent
}
await queueForBatch(notification);
return false; // Stop processing, will send in batch
}
return super.handle(notification);
}
}
class DeliveryHandler extends AbstractNotificationHandler {
async handle(notification: Notification): Promise<boolean> {
for (const channel of notification.channels) {
await channelService[channel].send(notification);
}
return super.handle(notification);
}
}
// Build the chain
const chain = new UserPreferenceHandler();
chain
.setNext(new QuietHoursHandler())
.setNext(new RateLimitHandler())
.setNext(new DeliveryHandler());
// Process notification through chain
await chain.handle(notification);
Benefits:
- Modularity: Each handler has a single responsibility
- Flexibility: Easy to add, remove, or reorder handlers
- Testability: Each handler can be tested in isolation
4. Strategy Pattern
Allows switching notification delivery strategies at runtime based on context. Essential for implementing different delivery modes like immediate, batched, or scheduled.
Example: Delivery Strategy Pattern
interface DeliveryStrategy {
deliver(notification: Notification): Promise<void>;
}
class ImmediateDeliveryStrategy implements DeliveryStrategy {
async deliver(notification: Notification): Promise<void> {
// Send immediately
for (const channel of notification.channels) {
await channelService[channel].send(notification);
}
}
}
class BatchedDeliveryStrategy implements DeliveryStrategy {
async deliver(notification: Notification): Promise<void> {
// Add to batch queue
await batchQueue.add(notification);
// Batch processor runs every 15 minutes
// and sends digest emails/notifications
}
}
class ScheduledDeliveryStrategy implements DeliveryStrategy {
constructor(private deliveryTime: Date) {}
async deliver(notification: Notification): Promise<void> {
await scheduleQueue.add(notification, {
delay: this.deliveryTime.getTime() - Date.now()
});
}
}
class TimeZoneAwareDeliveryStrategy implements DeliveryStrategy {
constructor(private targetHour: number) {}
async deliver(notification: Notification): Promise<void> {
const user = await getUser(notification.userId);
const deliveryTime = getNextOccurrenceOfHour(this.targetHour, user.timezone);
await new ScheduledDeliveryStrategy(deliveryTime).deliver(notification);
}
}
class NotificationService {
async send(notification: Notification) {
let strategy: DeliveryStrategy;
// Choose strategy based on notification type and priority
if (notification.priority === 'urgent') {
strategy = new ImmediateDeliveryStrategy();
} else if (notification.category === 'digest') {
strategy = new BatchedDeliveryStrategy();
} else if (notification.deliveryTime) {
strategy = new ScheduledDeliveryStrategy(notification.deliveryTime);
} else if (notification.category === 'marketing') {
strategy = new TimeZoneAwareDeliveryStrategy(9); // 9 AM user's time
} else {
strategy = new ImmediateDeliveryStrategy();
}
await strategy.deliver(notification);
}
}
Benefits:
- Flexibility: Switch delivery behavior at runtime
- Maintainability: Each strategy is independent and easy to modify
- Extensibility: Add new strategies without modifying existing code
Building these patterns from scratch requires significant engineering effort. MagicBell's platform implements these patterns out-of-the-box with a clean API, letting you send notifications with a single API call while benefiting from battle-tested architecture.
Scalability & Performance
Scaling a notification system to handle millions of notifications per day requires careful planning across multiple dimensions: infrastructure, database design, rate limiting, and monitoring.
Handling High Throughput
Scale Requirements:
- 10 million push notifications per day = ~115 notifications/second average, ~500/sec peak
- 5 million emails per day = ~58 emails/second average, ~250/sec peak
- 1 million SMS per day = ~11 SMS/second average, ~50/sec peak
Scalability Strategies:
1. Horizontal Scaling (Stateless Services)
All notification services should be stateless, storing state in external systems:
- Redis: User preferences cache, rate limit counters, device tokens
- PostgreSQL/MySQL: Notification history, delivery status, user data
- Message Queue: Pending notifications, retry queue
This allows you to scale by simply adding more instances:
# Kubernetes deployment example
apiVersion: apps/v1
kind: Deployment
metadata:
name: notification-processor
spec:
replicas: 10 # Scale to 10 instances
template:
spec:
containers:
- name: processor
image: notification-processor:latest
env:
- name: REDIS_URL
value: redis://cache:6379
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
2. Load Balancing
Distribute traffic across multiple instances:
- API Gateway: NGINX, AWS ALB, Cloudflare
- Message Queue Consumers: Multiple workers consuming from same queue
- Database: Read replicas for queries, primary for writes
3. Partitioning Strategies
User-Based Sharding:
// Route notifications to specific queue based on user ID
function getQueueForUser(userId) {
const hash = hashCode(userId);
const queueIndex = hash % NUM_QUEUES;
return `notifications_queue_${queueIndex}`;
}
Channel-Based Isolation:
[Push Queue] → Push Processor (high priority, fast)
[Email Queue] → Email Processor (medium priority)
[SMS Queue] → SMS Processor (low volume, expensive)
Geographic Distribution:
- US East queue → US data center
- EU queue → EU data center (GDPR compliance)
- APAC queue → Asia data center
Rate Limiting Strategies
Essential for preventing notification fatigue and protecting external services from overload.
Per-User Rate Limits
Sliding Window Counter (Redis):
async function checkUserRateLimit(userId, maxPerHour = 10) {
const key = `rate_limit:user:${userId}`;
const now = Date.now();
const hourAgo = now - 3600000;
// Remove old entries
await redis.zremrangebyscore(key, 0, hourAgo);
// Count recent notifications
const count = await redis.zcard(key);
if (count >= maxPerHour) {
return false; // Rate limited
}
// Add current notification
await redis.zadd(key, now, `${now}-${Math.random()}`);
await redis.expire(key, 3600);
return true; // Allowed
}
Token Bucket Algorithm:
Allows bursts while maintaining average rate.
class TokenBucket {
constructor(capacity, refillRate) {
this.capacity = capacity; // Max tokens
this.tokens = capacity;
this.refillRate = refillRate; // Tokens per second
this.lastRefill = Date.now();
}
async consume(tokens = 1) {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return true; // Allowed
}
return false; // Rate limited
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
const tokensToAdd = elapsed * this.refillRate;
this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
this.lastRefill = now;
}
}
Per-System Rate Limits
Protect external services from overload:
- APNs: 2000 notifications/second per connection (use connection pooling)
- FCM: No hard limit, but implement exponential backoff on errors
- SendGrid: Varies by plan (100-3000 emails/second)
- Twilio: Varies by account (1-100 SMS/second)
Circuit Breaker Pattern:
class CircuitBreaker {
constructor(threshold, timeout) {
this.failureCount = 0;
this.threshold = threshold; // Open circuit after N failures
this.timeout = timeout; // Try again after X ms
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.nextAttempt = Date.now();
}
async execute(fn) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.timeout;
}
}
}
// Usage
const apnsCircuitBreaker = new CircuitBreaker(5, 60000); // Open after 5 failures, retry after 1 min
try {
await apnsCircuitBreaker.execute(() => apns.send(notification));
} catch (error) {
// Fallback: Queue for retry or use alternative provider
await retryQueue.add(notification);
}
Database Design for Scale
Notification Storage Strategy:
Hot Storage (Redis):
- Recent notifications (last 24 hours)
- Unread notifications
- Fast retrieval for in-app notification feed
Warm Storage (PostgreSQL):
- Last 90 days of notifications
- Indexed for user queries
- Full search capabilities
Cold Storage (S3/Glacier):
- Historical data (90+ days)
- Compliance and audit trails
- Compressed and archived
Optimized Database Schema:
-- Partition by created_at for efficient queries
CREATE TABLE notifications (
id BIGSERIAL,
user_id VARCHAR(255) NOT NULL,
notification_type VARCHAR(50) NOT NULL,
channels JSONB NOT NULL,
content JSONB NOT NULL,
read_at TIMESTAMP,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
-- Create monthly partitions
CREATE TABLE notifications_2025_11 PARTITION OF notifications
FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
CREATE TABLE notifications_2025_12 PARTITION OF notifications
FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');
-- Indexes for common queries
CREATE INDEX idx_user_created ON notifications(user_id, created_at DESC);
CREATE INDEX idx_user_unread ON notifications(user_id, created_at DESC)
WHERE read_at IS NULL;
-- Delivery status tracking (separate table for better performance)
CREATE TABLE notification_delivery_status (
id BIGSERIAL PRIMARY KEY,
notification_id BIGINT NOT NULL,
channel VARCHAR(50) NOT NULL,
status VARCHAR(50) NOT NULL, -- sent, delivered, opened, clicked, failed
provider_id VARCHAR(255), -- External provider's tracking ID
delivered_at TIMESTAMP,
opened_at TIMESTAMP,
clicked_at TIMESTAMP,
failed_at TIMESTAMP,
failure_reason TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_notification_channel ON notification_delivery_status(notification_id, channel);
CREATE INDEX idx_status_created ON notification_delivery_status(status, created_at DESC);
Query Optimization:
-- Bad: Full table scan
SELECT * FROM notifications WHERE user_id = 'user_123' ORDER BY created_at DESC;
-- Good: Uses index
SELECT * FROM notifications
WHERE user_id = 'user_123'
AND created_at >= NOW() - INTERVAL '90 days'
ORDER BY created_at DESC
LIMIT 50;
-- Better: Fetch unread count separately
SELECT COUNT(*) FROM notifications
WHERE user_id = 'user_123'
AND read_at IS NULL
AND created_at >= NOW() - INTERVAL '30 days';
Caching Strategy (Redis):
async function getUserNotifications(userId, limit = 50) {
const cacheKey = `notifications:${userId}:recent`;
// Try cache first
let notifications = await redis.get(cacheKey);
if (notifications) {
return JSON.parse(notifications);
}
// Cache miss - query database
notifications = await db.notifications
.where({user_id: userId})
.where('created_at', '>=', Date.now() - 86400000) // Last 24 hours
.orderBy('created_at', 'desc')
.limit(limit);
// Cache for 5 minutes
await redis.setex(cacheKey, 300, JSON.stringify(notifications));
return notifications;
}
Monitoring & Observability
Track these critical metrics to ensure system health:
Delivery Metrics:
- Delivery Success Rate: Target >99.5%
- Time to Deliver:
- Push: P95 < 5 seconds
- Email: P95 < 30 seconds
- SMS: P95 < 10 seconds
- Retry Rate: Alert if >5% of notifications need retries
- DLQ Size: Alert if >100 notifications in dead letter queue
Infrastructure Metrics:
- Queue Depth: Alert if >10,000 pending notifications
- Consumer Lag: Time between message enqueue and dequeue
- Database Query Time: P95 < 100ms
- Cache Hit Rate: Target >90%
Example Monitoring Dashboard (Prometheus + Grafana):
// Instrument notification service
const prometheusClient = require('prom-client');
const notificationsSent = new prometheusClient.Counter({
name: 'notifications_sent_total',
help: 'Total notifications sent',
labelNames: ['channel', 'status']
});
const notificationDeliveryDuration = new prometheusClient.Histogram({
name: 'notification_delivery_duration_seconds',
help: 'Time to deliver notification',
labelNames: ['channel'],
buckets: [0.1, 0.5, 1, 2, 5, 10, 30]
});
async function sendNotification(notification, channel) {
const timer = notificationDeliveryDuration.startTimer({channel});
try {
await channelService[channel].send(notification);
notificationsSent.inc({channel, status: 'success'});
} catch (error) {
notificationsSent.inc({channel, status: 'failure'});
throw error;
} finally {
timer();
}
}
Alert Examples:
# Prometheus alerting rules
groups:
- name: notifications
rules:
- alert: HighFailureRate
expr: rate(notifications_sent_total{status="failure"}[5m]) > 0.05
annotations:
summary: "Notification failure rate above 5%"
- alert: QueueBacklog
expr: notification_queue_depth > 10000
annotations:
summary: "Notification queue has {{ $value }} pending notifications"
- alert: SlowDelivery
expr: histogram_quantile(0.95, notification_delivery_duration_seconds) > 30
annotations:
summary: "P95 delivery time is {{ $value }}s (threshold: 30s)"
Managing all of this infrastructure is complex. MagicBell provides enterprise-grade notification infrastructure with built-in monitoring, guaranteed 99.99% uptime SLA, automatic scaling, and detailed analytics dashboards—so you can focus on building features instead of managing infrastructure.
System Design Interview Preparation
Notification system design is a popular system design interview question. Here's how to approach it, key discussion points, and what interviewers are looking for.
Common Interview Prompt
"Design a notification system that can send 10 million push notifications, 1 million SMS, and 5 million emails per day. The system should support multiple notification types and ensure reliable delivery."
Step 1: Clarify Requirements
Always start by clarifying functional and non-functional requirements with your interviewer.
Functional Requirements
Ask these questions to scope the problem:
-
What notification types do we need to support?
- Transactional (payment confirmations, security alerts)
- Promotional (marketing campaigns, feature announcements)
- System alerts (server downtime, scheduled maintenance)
-
What delivery channels?
- Push notifications (iOS, Android, web)
- SMS
- In-app notifications
-
Do we need user preferences?
- Opt-out mechanisms
- Channel preferences (email-only, push-only)
- Quiet hours
- Notification frequency limits
-
Do we need notification history?
- User can view past notifications
- Search and filter capabilities
- Read/unread status
-
Any special features?
- Scheduled notifications
- Batch notifications (digest emails)
- Priority levels (urgent, normal, low)
- Templates and personalization
Non-Functional Requirements
-
Scale:
- Daily volume: 10M push, 5M email, 1M SMS
- Peak traffic: Assume 5x average during peak hours
- Calculation: 10M push/day = 115/sec average, 575/sec peak
-
Reliability:
- At-least-once delivery: No notifications should be lost
- Idempotency: Duplicate sends should be handled gracefully
- SLA: 99.9% uptime, 99.5% delivery success rate
-
Latency:
- Push notifications: Delivered within 5 seconds
- Emails: Delivered within 30 seconds
- SMS: Delivered within 10 seconds
-
Availability:
- System should handle partial failures (e.g., email service down but push still works)
- Graceful degradation during high load
-
Cost:
- SMS is expensive (~$0.01 per message = $10k/day for 1M SMS)
- Optimize delivery to reduce costs
Step 2: High-Level Design
Draw a simple architecture diagram and walk through the flow.
┌─────────────┐
│ Client App │
│ (Triggers) │
└──────┬──────┘
│
▼
┌──────────────────┐ ┌─────────────┐
│ API Gateway │─────▶│ Redis │ (Rate Limiting)
│ (Load Balancer) │ └─────────────┘
└──────┬───────────┘
│
▼
┌──────────────────┐ ┌─────────────┐
│ Notification │─────▶│ PostgreSQL │ (User Prefs, History)
│ Service │ └─────────────┘
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Message Queue │ (Kafka / RabbitMQ / SQS)
│ (Partitioned) │
└──────┬───────────┘
│
├──────────────┬──────────────┬──────────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Push │ │ Email │ │ SMS │ │ In-App │
│ Processor│ │Processor │ │Processor │ │Processor │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌──────────┐
│ APNs │ │SendGrid │ │ Twilio │ │WebSocket │
│ FCM │ │ AWS SES │ │ │ │ Server │
└─────────┘ └─────────┘ └─────────┘ └──────────┘
Walk through the flow:
- Client triggers notification via API call
- API Gateway authenticates, validates, rate limits
- Notification Service processes business logic, fetches user preferences
- Message published to appropriate queue (push, email, SMS)
- Channel-specific processors consume from queues
- External providers deliver notifications
- Delivery status tracked in database
Step 3: Deep Dive - Key Discussion Topics
Topic 1: Why Use Message Queues?
Interviewer asks: "Why not just send notifications synchronously in the API call?"
Your answer:
- Asynchronous Processing: API responds immediately (< 100ms), notification sends in background (may take seconds)
- Traffic Buffering: Handle traffic spikes without overwhelming downstream services
- Reliability: Messages persist in queue until successfully processed (at-least-once delivery)
- Scalability: Easy to add more consumers to process faster
- Retry Logic: Failed messages stay in queue for retries
Example:
Synchronous:
API call → Send push → Send email → Send SMS → Return response
Total time: 500ms + 1000ms + 800ms = 2.3 seconds (bad UX)
Asynchronous:
API call → Enqueue notification → Return 202 Accepted
Total time: 50ms (good UX)
Background workers handle actual delivery
Topic 2: How to Prevent Duplicate Notifications?
Interviewer asks: "What if the same API call is made twice by accident?"
Your answer:
- Idempotency Keys: Client provides unique ID, server deduplicates
POST /api/notifications
{
"idempotency_key": "payment_123_notification",
"user_id": "user_456",
...
}
// Server logic
async function createNotification(request) {
const existing = await redis.get(`idempotency:${request.idempotency_key}`);
if (existing) {
return JSON.parse(existing); // Return cached response
}
const notification = await sendNotification(request);
await redis.setex(
`idempotency:${request.idempotency_key}`,
86400, // Cache for 24 hours
JSON.stringify(notification)
);
return notification;
}
- Message Queue Deduplication: Use message ID to prevent duplicate processing
- Database Constraints: Unique constraint on (user_id, notification_type, external_id)
Topic 3: How to Handle Third-Party Service Failures?
Interviewer asks: "What if APNs is down? How do we ensure notifications aren't lost?"
Your answer:
-
Retry with Exponential Backoff:
- 1st retry: 1 minute
- 2nd retry: 5 minutes
- 3rd retry: 30 minutes
- After N retries → Dead Letter Queue
-
Circuit Breaker Pattern:
- Detect failures (e.g., 5 consecutive errors)
- Open circuit (stop sending to APNs for 1 minute)
- Half-open (try single request)
- Close circuit if successful
-
Fallback to Alternative Providers:
- Primary: APNs → Fallback: AWS SNS
- Primary: SendGrid → Fallback: AWS SES
-
Dead Letter Queue (DLQ):
- Failed notifications after all retries
- Alert operations team
- Manual investigation
Topic 4: How to Optimize Database Queries?
Interviewer asks: "If a user has 100,000 notifications, how do we efficiently query them?"
Your answer:
- Indexing:
CREATE INDEX idx_user_created ON notifications(user_id, created_at DESC);
- Pagination (Cursor-Based):
SELECT * FROM notifications
WHERE user_id = 'user_123'
AND created_at < '2025-11-20 10:00:00' -- Cursor
ORDER BY created_at DESC
LIMIT 50;
- Partitioning:
-- Monthly partitions
CREATE TABLE notifications_2025_11 PARTITION OF notifications
FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
-
Caching:
- Cache recent 24 hours in Redis
- Cache unread count
- Invalidate cache on new notification
-
Archival:
- Move notifications older than 90 days to cold storage (S3)
Topic 5: How to Handle Different Time Zones?
Interviewer asks: "How do we send a marketing email at 9 AM user's local time?"
Your answer:
- Store user timezone in database
- Calculate delivery time:
function getDeliveryTime(userId, targetHour) {
const user = await getUser(userId);
const userTimezone = user.timezone; // e.g., 'America/New_York'
const now = moment().tz(userTimezone);
let deliveryTime = moment().tz(userTimezone).hour(targetHour).minute(0).second(0);
if (deliveryTime.isBefore(now)) {
deliveryTime.add(1, 'day'); // Schedule for tomorrow
}
return deliveryTime.toDate();
}
- Use scheduled notifications queue
- Batch by timezone to reduce queue operations
Step 4: Bottlenecks & Trade-offs
Discuss potential bottlenecks and how to address them:
Database Bottleneck:
- Problem: Single database can't handle 100k writes/sec
- Solution: Shard by user_id, use Cassandra/DynamoDB for high write throughput
Message Queue Bottleneck:
- Problem: Single Kafka partition can handle ~10k messages/sec
- Solution: Partition by user_id (10 partitions = 100k messages/sec)
External Service Rate Limits:
- Problem: APNs allows 2000 connections, each ~2000 notifications/sec
- Solution: Connection pooling, distribute across multiple servers
Cost Optimization:
- Problem: SMS costs $10k/day at scale
- Solution:
- Only send SMS for high-priority notifications
- Fallback to push/email for low-priority
- Batch multiple alerts into single SMS
What Interviewers Look For
✅ Good Answers:
- Clarifies requirements before diving into design
- Discusses trade-offs (consistency vs. availability, cost vs. reliability)
- Considers scalability from the start
- Mentions monitoring and observability
- Asks clarifying questions throughout
❌ Bad Answers:
- Jumps straight into implementation details
- Ignores scale considerations
- Doesn't discuss failure scenarios
- Over-engineers without understanding requirements
For teams actually building notification systems, the complexity goes far beyond interview questions. MagicBell's notification infrastructure solves all these challenges with a simple API, built-in scalability, multi-channel delivery, and comprehensive analytics. Focus your engineering time on your core product, not on notification infrastructure.
Conclusion
Notification system design is both an art and a science. The user-facing experience requires thoughtful UX design, clear information hierarchy, and respect for user preferences. The underlying architecture demands careful engineering: message queues for reliability, design patterns for maintainability, horizontal scaling for performance, and robust monitoring for visibility.
Key Takeaways
User Experience:
- Context matters: Notifications should provide enough information for users to act without switching mental contexts
- Categorization is critical: Transactional, system, account management, and marketing notifications have different user needs
- User control prevents fatigue: Quiet hours, frequency limits, and channel preferences are essential
System Architecture:
- Message queues enable asynchronous processing, traffic buffering, and reliability
- Design patterns (Observer, Factory, Chain of Responsibility, Strategy) create modular, maintainable systems
- Horizontal scaling with stateless services handles growth gracefully
- Rate limiting protects both users (from fatigue) and systems (from overload)
Scalability:
- Database partitioning and caching are essential at scale
- Multi-tier storage (hot/warm/cold) optimizes cost and performance
- Circuit breakers and retry mechanisms ensure resilience
- Monitoring and observability provide visibility into system health
The Build vs. Buy Decision
Building a production-ready notification system requires:
- Infrastructure Engineering: Message queues, databases, caching, load balancing
- Channel Integration: APNs, FCM, SendGrid, Twilio, and many more providers
- Reliability Engineering: Retry logic, circuit breakers, dead letter queues
- Monitoring & Operations: Dashboards, alerts, on-call rotation
- Compliance: GDPR, CAN-SPAM, TCPA, data retention policies
Conservative estimate: 6-12 months for a 3-engineer team to build a production-ready notification system. Ongoing maintenance requires at least 1 full-time engineer.
MagicBell provides all of this infrastructure out-of-the-box:
- Multi-channel delivery: Push, email, SMS, in-app, Slack, Teams, webhooks
- Guaranteed reliability: 99.99% uptime SLA, automatic retries, at-least-once delivery
- Built-in user preferences: Quiet hours, channel selection, frequency limits
- Real-time notifications: WebSocket support for instant in-app updates
- Comprehensive analytics: Delivery rates, open rates, click-through rates
- Developer-friendly API: Send notifications with a single API call
With MagicBell, your team can ship notification features in hours instead of months, letting you focus on your core product instead of notification infrastructure.
Further Reading
Technical Implementation:
- Building a Notification System in Ruby on Rails: Database Design
- Building a React Notification System
User Experience:
- Fighting Back Against Alert Overload
- Mindful Messaging: How Apps Can Make Their Notifications More Meaningful
Curious about the technical implementation? Read our article on building a notification system in Ruby on Rails. Or test our MagicBell playground.
If you want to see notification workflows in action, explore our pre-built templates for GitHub events (pull requests, issues, CI/CD) and Stripe events (payments, subscriptions, disputes).
