Notification system design is the process of building the user experience and technical backend for sending messages across channels like email, push, SMS, and in-app. It covers routing, delivery, user preferences, and scaling to millions of notifications per day.
A good system sends three main types: transactional, promotional, and system alerts. It must stay fast, reliable, and easy to use at scale.
This guide covers notification system design from two sides. First, the user experience. Second, the backend that powers it. Use it to build from scratch, prep for a system design interview, or pick build vs. buy.
User Experience & Design Principles
Rethinking Email Notification System Design
Can anyone else relate to the flood of email notifications in their inbox?
That is because email notifications are still the default notification service in workplace tools.
Our team uses Figma for design. Every time someone posts a comment, I get an email. But without context, it is not useful. And that is just one of six or more work tools I use daily.
Getting notifications from every app in my inbox never felt right. In my inbox, I think broadly. In a specific app, I think locally.
Context is key. When I open my inbox and see items from Figma, Notion, Google Docs, and Slack notification workflows, I have to reset my mind for each one. But in Google Docs, comments sit right next to the text. Acting on them is easy.
Here is another case. If you get several notifications from the same document within minutes, you will likely miss key ones. A bulk notification UI lets users filter and send messages to the right groups. This cuts overload and helps important updates land.
When MagicBell brought me on to help, I studied existing systems. Most of them focused on engagement in apps like Facebook and LinkedIn. In-app systems in work tools were weak. They leaned on email.
The reason is simple. Work tools focus on being work tools, not on notification system design.
As teams spread across devices and time zones, a better notification system matters more. Here is how I started to think about notification design and all its parts.
The Bell & Notification Center Window
The most obvious part of a web notification system is the notification bell and inbox UI. Thanks to social apps, we all know the popover modal or full page where notifications live.
When looking at a single notification, here is a short list of parts we can expect:
- Notification content.
- A time stamp or time since it arrived.
- An icon or thumbnail to show who it is from (person image, brand logo, etc.).
- A visual status: unseen, unread, or read.
- A global action for all items: "set importance," "mark as new," "mark as read," "mute," "archive," "delete."

A notification bell and pop-up modal -- see MagicBell's notification inbox for a production example
Social apps and work tools have different goals. Engagement is not the point at work. A work notification system helps users sort info fast, act on it, and let fewer items slip through.
To think about notification system design in a clear way, we group notifications like this:
Notification Type Taxonomy
Incoming Request or Message Notification
For work, most emails or incoming messages boil down to some type of request.
This could be an email, an SMS notification, a text via Slack or other apps, or a social media ping. The system treats each one as a notification request. It then formats and sends it through a channel like email, SMS, or push.
These can include responses or comments tied to the request. They sit apart from the thread itself since the recipient owns the action.
Notifications for Action Taken
When someone acts on a thread, the system needs to notify all users who follow it. Each type has its own actions plus global ones. For example, users can stop getting updates on that thread.
System Notifications
Workplace software is part of daily work. Updates about changes or outages matter to the end user. But only a few are urgent when they arrive. We designed system notifications to be visible but let users control when they see them.
Account Management Notifications
These cover admin-driven changes. They relate to the user account and to areas like billing. Like system notifications, most are just informative. They rarely reach the top of the pile.
Marketing Notifications
These cover product updates and what is new. They are the least tied to daily work. How much users interact with them is up to them.
Acting on Notifications
We focused on showing the most info while letting users decide fast whether to act.
Here is more about the actions available after getting each notification type:
Communication Notifications, Push Notifications & Actions
- View in context.
- Quick reply.
- Reminder ("at a scheduled time," "snooze," or even logic like "when X happens, do Y").
- Delegate.
- Label (to organize).
- Set status (to show action or share changes with team members).
For example, a long-form notification has a button to view the full message. A comment notification lets you jump into the chat to reply.
All action notifications also benefit from a snooze, reminder, status, label, or delegate option.
System, Account Management, & Marketing
These rarely need action. They are less important to the user's day. But when action is needed, they should rise to the top.
System Architecture Components
Building a production-ready notification system takes careful planning. The infrastructure must handle high volumes, stay reliable, and scale well. Here is how the parts fit together.
Core Architecture Overview
A strong notification system has several connected layers:
- API Gateway / Notification Service - Entry point for all notification requests.
- Message Queue System - Buffers requests and enables async processing.
- Notification Processor - Handles business logic, user preferences, and routing.
- Channel Services - Delivery services for each notification channel.
- External Delivery Providers - Third-party services (APNs, FCM, SMTP, Twilio).
- Storage & Tracking - Database for notification history and delivery status.
- Retry & Dead Letter Queue - Handles failed deliveries.

1. Notification Service (API Gateway)
This is the entry point for all notification requests. It checks incoming data, verifies API calls, and routes notifications to the right channels. Think of it as the router. It decides if a notification goes via email, push, SMS, or in-app.
Key Responsibilities:
- Request validation and authentication.
- Rate limiting to prevent abuse.
- Routing logic based on notification type.
- Duplicate checks.
Example API Request:
POST /api/notifications
{
"user_id": "user_123",
"notification_type": "transactional",
"channels": ["push", "email"],
"priority": "high",
"content": {
"title": "Payment Received",
"body": "Your payment of $99 was processed successfully."
},
"metadata": {
"transaction_id": "txn_456"
}
}
2. Message Queue System
Message queues are critical for high volumes and reliability. They split creation from delivery. This lets the system handle traffic spikes without breaking.
Popular Choices:
- Apache Kafka: Best for high-throughput event streaming (100k+ messages/sec).
- RabbitMQ: Great for complex routing logic and message priority.
- AWS SQS: Managed solution with built-in retry and dead letter queues.
- Redis Streams: Lightweight option for moderate volumes with pub/sub patterns.
Why Message Queues Are Essential:
- Async Processing: The API responds right away. Notifications process in the background.
- Traffic Buffering: Handles spikes without overloading other services.
- Reliability: Messages persist until processed (at-least-once delivery).
- Scalability: Add more consumers to process messages faster.
Queue Design Pattern:
API Gateway → [Priority Queue] → High Priority Processor
→ [Standard Queue] → Standard Processor
→ [Bulk Queue] → Batch Processor
3. Notification Processor (Business Logic Layer)
This is the brain of the system. It:
- Gets user preferences.
- Picks channels based on opt-outs, quiet hours, and settings.
- Applies templates and fills in user data.
- Enforces rate limits per user and channel.
- Sends to the right Channel Services.
User Preference Checks:
async function processNotification(notification) {
const user = await userPreferenceService.getPreferences(notification.user_id);
// Respect quiet hours
if (isQuietHours(user.timezone, user.quiet_hours)) {
if (notification.priority !== 'urgent') {
await scheduleForLater(notification, user.quiet_hours.end);
return;
}
}
// Filter channels based on user opt-outs
const allowedChannels = notification.channels.filter(channel =>
user.enabled_channels.includes(channel)
);
// Check rate limits
if (await isRateLimited(notification.user_id, allowedChannels)) {
await queueForBatch(notification);
return;
}
// Route to channel services
for (const channel of allowedChannels) {
await channelServices[channel].send(notification);
}
}
4. Channel Services (Delivery Layer)
Each channel gets its own service. Channels differ in how they deliver and how they fail.
Push Notification Service
A mobile push notification service works with:
- APNs (Apple Push Notification Service): For iOS devices.
- FCM (Firebase Cloud Messaging): For Android devices and web push.
- Web Push Protocol: For browser notifications.
Key Challenges:
- Token management (tokens expire, users uninstall apps).
- Different payload formats per platform.
- Key rotation for APNs.
- Silent vs. visible notifications.
Email Service
Connects with SMTP providers or email APIs:
- SendGrid, Mailgun, AWS SES, Postmark.
Key Considerations:
- HTML vs. plain text.
- Spam score tuning.
- Bounce and complaint handling.
- Unsubscribe link rules (CAN-SPAM, GDPR).
SMS Service
Routes through SMS gateways:
- Twilio, AWS SNS, MessageBird.
Key Challenges:
- Length limits (160 chars for SMS, 1600 for MMS).
- Cross-border delivery and carrier issues.
- Cost control (SMS is pricey at scale).
- Shortcode vs. long code vs. toll-free.
In-App Notification Service
Sends real-time notifications inside the app:
- WebSockets for real-time delivery.
- Server-Sent Events (SSE) for one-way streaming.
- Long polling as fallback.
5. User Preferences Service
This stores and retrieves user notification preferences. It is key for respecting user choices and preventing fatigue.
Stored Preferences:
- Enabled channels (email, push, SMS, in-app).
- Quiet hours and timezone.
- Limits (max per hour/day).
- Category subscriptions (marketing, transactional, social).
- Language and locale.
Database Schema Example:
CREATE TABLE user_preferences (
user_id VARCHAR(255) PRIMARY KEY,
enabled_channels JSONB DEFAULT '["email", "push"]',
quiet_hours JSONB DEFAULT '{"start": "22:00", "end": "08:00"}',
timezone VARCHAR(50) DEFAULT 'UTC',
max_notifications_per_hour INTEGER DEFAULT 10,
subscribed_categories JSONB DEFAULT '["transactional"]',
locale VARCHAR(10) DEFAULT 'en-US',
updated_at TIMESTAMP DEFAULT NOW()
);
6. Notification Tracker & Analytics
This logs delivery status and tracks how well notifications perform.
Tracked Metrics:
- Sent: Handed off to the delivery provider.
- Delivered: Provider confirms it arrived (push reached device, email accepted).
- Opened: User opened the notification or email.
- Clicked: User clicked a link in it.
- Failed: Delivery failed (bad device token, bounced email, etc.).
Database Schema Example:
CREATE TABLE notification_logs (
id SERIAL PRIMARY KEY,
notification_id VARCHAR(255) UNIQUE NOT NULL,
user_id VARCHAR(255) NOT NULL,
channel VARCHAR(50) NOT NULL,
status VARCHAR(50) NOT NULL,
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
opened_at TIMESTAMP,
clicked_at TIMESTAMP,
failed_at TIMESTAMP,
failure_reason TEXT,
metadata JSONB
);
CREATE INDEX idx_user_status ON notification_logs(user_id, status);
CREATE INDEX idx_sent_at ON notification_logs(sent_at DESC);
7. Retry Mechanism & Dead Letter Queue
This handles failed deliveries with smart retry logic.
Retry Strategy (Exponential Backoff):
- 1st retry: After 1 minute.
- 2nd retry: After 5 minutes.
- 3rd retry: After 30 minutes.
- 4th retry: After 2 hours.
- After 4 failed attempts: Move to Dead Letter Queue.
Dead Letter Queue (DLQ):
Failed notifications that pass all retries go to a DLQ for:
- Manual review.
- Alerting the ops team.
- Finding patterns (e.g., all emails to one domain bounce).
8. Notification Template Repository
This stores templates for each notification type. Teams keep messages consistent and update content fast.
Template Structure:
{
"template_id": "payment_received",
"channels": {
"email": {
"subject": "Payment Received - {{amount}}",
"html_body": "<html>...</html>",
"text_body": "Your payment of {{amount}} was processed..."
},
"push": {
"title": "Payment Received",
"body": "Your payment of {{amount}} was successful."
},
"sms": {
"body": "Payment received: {{amount}}. Transaction ID: {{transaction_id}}"
}
},
"variables": ["amount", "transaction_id", "timestamp"]
}
9. Scheduled Notifications
This manages notifications set for a specific time. Use cases include:
- Reminders (meeting in 15 minutes).
- Time-zone-aware notifications (send at 9 AM user's local time).
- Scheduled campaigns (product launch announcement).
- Digest emails (daily summary of activity).
Implementation:
- Option 1: Database polling (check every minute for due notifications).
- Option 2: Delayed message queues (Redis with ZADD + score as timestamp).
- Option 3: Cron-based scheduler (Kubernetes CronJobs, AWS EventBridge).
How Components Work Together
In a typical notification flow:
- Business service (e.g., payment processor) calls the Notification API.
- API Gateway validates the request, assigns a unique notification ID, and returns 202 Accepted.
- Message Queue receives the notification event.
- Notification Processor consumes from the queue:
- Fetches user preferences.
- Checks rate limits.
- Applies the notification template.
- Picks delivery channels.
- Channel Services send to external providers (APNs, FCM, SendGrid, Twilio).
- Notification Tracker logs delivery status.
- Retry Mechanism handles failures.
- Analytics Dashboard shows real-time delivery metrics.
For teams building notification systems, this setup provides reliability and scale. But building it all is complex and slow. MagicBell's notification infrastructure handles these components out-of-the-box. Focus on your core product instead of notification plumbing.
Essential Design Patterns
Good notification systems use proven design patterns. These keep the code modular, easy to maintain, and ready to scale. Here are the four core patterns.
1. Observer Pattern (Publish/Subscribe)
This is ideal for event-driven notifications. When an event occurs (comment posted, payment processed), subscribers get notified on their own.
How It Works:
- Publishers emit events without knowing who will consume them.
- Subscribers register interest in specific event types.
- Event Bus routes events to the right subscribers.
Example: GitHub Comment Notifications
// Event Publisher
class CommentService {
async createComment(issueId, userId, content) {
const comment = await db.comments.create({issueId, userId, content});
// Publish event - doesn't know who's listening
eventBus.publish('comment.created', {
issueId,
commentId: comment.id,
authorId: userId,
content
});
return comment;
}
}
// Event Subscribers
eventBus.subscribe('comment.created', async (event) => {
// Notify issue author
const issue = await db.issues.findById(event.issueId);
await notificationService.send({
userId: issue.authorId,
type: 'comment_on_your_issue',
data: event
});
});
eventBus.subscribe('comment.created', async (event) => {
// Notify @mentioned users
const mentions = extractMentions(event.content);
for (const userId of mentions) {
await notificationService.send({
userId,
type: 'mentioned_in_comment',
data: event
});
}
});
eventBus.subscribe('comment.created', async (event) => {
// Notify thread participants
const participants = await db.comments
.where({issueId: event.issueId})
.distinct('userId');
for (const userId of participants) {
if (userId !== event.authorId) {
await notificationService.send({
userId,
type: 'activity_on_subscribed_thread',
data: event
});
}
}
});
Benefits:
- Decoupling: Publishers do not need to know about notification logic.
- Extensibility: Add new subscribers without changing publishers.
- Testability: Test each subscriber on its own.
2. Factory Method Pattern
This lets you create different notification types based on context. It works well when each category has its own format and delivery needs.
Example: Multi-Channel Notification Factory
interface Notification {
send(): Promise<void>;
}
interface NotificationFactory {
createNotification(type: string, data: any): Notification;
}
class PushNotification implements Notification {
constructor(private userId: string, private title: string, private body: string) {}
async send() {
const deviceTokens = await getDeviceTokens(this.userId);
await fcm.sendMulticast({
tokens: deviceTokens,
notification: {
title: this.title,
body: this.body
}
});
}
}
class EmailNotification implements Notification {
constructor(private userId: string, private subject: string, private htmlBody: string) {}
async send() {
const user = await getUser(this.userId);
await emailService.send({
to: user.email,
subject: this.subject,
html: this.htmlBody
});
}
}
class SMSNotification implements Notification {
constructor(private userId: string, private message: string) {}
async send() {
const user = await getUser(this.userId);
await twilioClient.messages.create({
to: user.phoneNumber,
body: this.message
});
}
}
class NotificationFactoryImpl implements NotificationFactory {
createNotification(type: string, data: any): Notification {
switch(type) {
case 'push':
return new PushNotification(data.userId, data.title, data.body);
case 'email':
return new EmailNotification(data.userId, data.subject, data.htmlBody);
case 'sms':
return new SMSNotification(data.userId, data.message);
default:
throw new Error(`Unknown notification type: ${type}`);
}
}
}
// Usage
const factory = new NotificationFactoryImpl();
const notification = factory.createNotification('push', {
userId: 'user_123',
title: 'New Message',
body: 'You have a new message from Alice'
});
await notification.send();
Benefits:
- Single Responsibility: Each notification class handles one channel.
- Open/Closed Principle: Add new notification types without changing existing code.
- Type Safety: Strong typing ensures correct data for each type.
3. Chain of Responsibility Pattern
This routes notifications through a chain of handlers. Each one decides to process, modify, or pass it along. It is great for filters, validation, and priority-based delivery.
Example: Notification Processing Pipeline
interface NotificationHandler {
setNext(handler: NotificationHandler): NotificationHandler;
handle(notification: Notification): Promise<boolean>;
}
abstract class AbstractNotificationHandler implements NotificationHandler {
private nextHandler: NotificationHandler | null = null;
setNext(handler: NotificationHandler): NotificationHandler {
this.nextHandler = handler;
return handler;
}
async handle(notification: Notification): Promise<boolean> {
if (this.nextHandler) {
return this.nextHandler.handle(notification);
}
return true;
}
}
class UserPreferenceHandler extends AbstractNotificationHandler {
async handle(notification: Notification): Promise<boolean> {
const prefs = await getUserPreferences(notification.userId);
// Filter out disabled channels
notification.channels = notification.channels.filter(channel =>
prefs.enabledChannels.includes(channel)
);
if (notification.channels.length === 0) {
console.log('All channels disabled for user');
return false; // Stop processing
}
return super.handle(notification);
}
}
class QuietHoursHandler extends AbstractNotificationHandler {
async handle(notification: Notification): Promise<boolean> {
if (notification.priority === 'urgent') {
return super.handle(notification); // Skip quiet hours for urgent
}
const prefs = await getUserPreferences(notification.userId);
const userTime = getCurrentTimeInTimezone(prefs.timezone);
if (isWithinQuietHours(userTime, prefs.quietHours)) {
await scheduleForLater(notification, prefs.quietHours.end);
return false; // Stop processing, scheduled for later
}
return super.handle(notification);
}
}
class RateLimitHandler extends AbstractNotificationHandler {
async handle(notification: Notification): Promise<boolean> {
const count = await getNotificationCount(
notification.userId,
Date.now() - 3600000 // Last hour
);
const prefs = await getUserPreferences(notification.userId);
if (count >= prefs.maxNotificationsPerHour) {
if (notification.priority === 'urgent') {
return super.handle(notification); // Bypass rate limit for urgent
}
await queueForBatch(notification);
return false; // Stop processing, will send in batch
}
return super.handle(notification);
}
}
class DeliveryHandler extends AbstractNotificationHandler {
async handle(notification: Notification): Promise<boolean> {
for (const channel of notification.channels) {
await channelService[channel].send(notification);
}
return super.handle(notification);
}
}
// Build the chain
const chain = new UserPreferenceHandler();
chain
.setNext(new QuietHoursHandler())
.setNext(new RateLimitHandler())
.setNext(new DeliveryHandler());
// Process notification through chain
await chain.handle(notification);
Benefits:
- Modularity: Each handler has a single job.
- Flexibility: Add, remove, or reorder handlers with ease.
- Testability: Test each handler on its own.
4. Strategy Pattern
This lets you switch delivery methods at runtime. Use it for modes like immediate, batched, or scheduled.
Example: Delivery Strategy Pattern
interface DeliveryStrategy {
deliver(notification: Notification): Promise<void>;
}
class ImmediateDeliveryStrategy implements DeliveryStrategy {
async deliver(notification: Notification): Promise<void> {
// Send immediately
for (const channel of notification.channels) {
await channelService[channel].send(notification);
}
}
}
class BatchedDeliveryStrategy implements DeliveryStrategy {
async deliver(notification: Notification): Promise<void> {
// Add to batch queue
await batchQueue.add(notification);
// Batch processor runs every 15 minutes
// and sends digest emails/notifications
}
}
class ScheduledDeliveryStrategy implements DeliveryStrategy {
constructor(private deliveryTime: Date) {}
async deliver(notification: Notification): Promise<void> {
await scheduleQueue.add(notification, {
delay: this.deliveryTime.getTime() - Date.now()
});
}
}
class TimeZoneAwareDeliveryStrategy implements DeliveryStrategy {
constructor(private targetHour: number) {}
async deliver(notification: Notification): Promise<void> {
const user = await getUser(notification.userId);
const deliveryTime = getNextOccurrenceOfHour(this.targetHour, user.timezone);
await new ScheduledDeliveryStrategy(deliveryTime).deliver(notification);
}
}
class NotificationService {
async send(notification: Notification) {
let strategy: DeliveryStrategy;
// Choose strategy based on notification type and priority
if (notification.priority === 'urgent') {
strategy = new ImmediateDeliveryStrategy();
} else if (notification.category === 'digest') {
strategy = new BatchedDeliveryStrategy();
} else if (notification.deliveryTime) {
strategy = new ScheduledDeliveryStrategy(notification.deliveryTime);
} else if (notification.category === 'marketing') {
strategy = new TimeZoneAwareDeliveryStrategy(9); // 9 AM user's time
} else {
strategy = new ImmediateDeliveryStrategy();
}
await strategy.deliver(notification);
}
}
Benefits:
- Flexibility: Switch delivery behavior at runtime.
- Maintainability: Each strategy is independent and easy to change.
- Extensibility: Add new strategies without changing existing code.
Building these patterns from scratch takes a lot of work. A managed notification service has them built in with a clean API. Send notifications with one API call and use a battle-tested backend.
Scalability & Performance in Notification System Design
Scaling a notification system to millions of deliveries per day takes planning across infrastructure, databases, rate limiting, and monitoring.
Handling High Throughput in Notification System Design
A production notification system must handle bursty traffic without dropping messages. Here are the throughput numbers to plan around.
Scale Requirements:
- 10 million push notifications per day = ~115 notifications/second average, ~500/sec peak.
- 5 million emails per day = ~58 emails/second average, ~250/sec peak.
- 1 million SMS per day = ~11 SMS/second average, ~50/sec peak.
Scalability Strategies:
1. Horizontal Scaling (Stateless Services)
All notification services should be stateless. Keep state in external stores:
- Redis: User preferences cache, rate limit counters, device tokens.
- PostgreSQL/MySQL: Notification history, delivery status, user data.
- Message Queue: Pending notifications, retry queue.
This lets you scale by simply adding more instances:
# Kubernetes deployment example
apiVersion: apps/v1
kind: Deployment
metadata:
name: notification-processor
spec:
replicas: 10 # Scale to 10 instances
template:
spec:
containers:
- name: processor
image: notification-processor:latest
env:
- name: REDIS_URL
value: redis://cache:6379
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
2. Load Balancing
Spread traffic across multiple instances:
- API Gateway: NGINX, AWS ALB, Cloudflare.
- Message Queue Consumers: Multiple workers consuming from the same queue.
- Database: Read replicas for queries, primary for writes.
3. Partitioning Strategies
User-Based Sharding:
// Route notifications to specific queue based on user ID
function getQueueForUser(userId) {
const hash = hashCode(userId);
const queueIndex = hash % NUM_QUEUES;
return `notifications_queue_${queueIndex}`;
}
Channel-Based Isolation:
[Push Queue] → Push Processor (high priority, fast)
[Email Queue] → Email Processor (medium priority)
[SMS Queue] → SMS Processor (low volume, expensive)
Geographic Distribution:
- US East queue: US data center.
- EU queue: EU data center (GDPR compliance).
- APAC queue: Asia data center.
Rate Limiting Strategies
Rate limiting is critical for notification system design at scale. It prevents notification fatigue and protects external services from overload.
Per-User Rate Limits
In any notification system design, per-user rate limits keep individual users from being overwhelmed by too many messages.
Sliding Window Counter (Redis):
async function checkUserRateLimit(userId, maxPerHour = 10) {
const key = `rate_limit:user:${userId}`;
const now = Date.now();
const hourAgo = now - 3600000;
// Remove old entries
await redis.zremrangebyscore(key, 0, hourAgo);
// Count recent notifications
const count = await redis.zcard(key);
if (count >= maxPerHour) {
return false; // Rate limited
}
// Add current notification
await redis.zadd(key, now, `${now}-${Math.random()}`);
await redis.expire(key, 3600);
return true; // Allowed
}
Token Bucket Algorithm:
This allows bursts while keeping the average rate steady.
class TokenBucket {
constructor(capacity, refillRate) {
this.capacity = capacity; // Max tokens
this.tokens = capacity;
this.refillRate = refillRate; // Tokens per second
this.lastRefill = Date.now();
}
async consume(tokens = 1) {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return true; // Allowed
}
return false; // Rate limited
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
const tokensToAdd = elapsed * this.refillRate;
this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
this.lastRefill = now;
}
}
Per-System Rate Limits
Guard external services from overload:
- APNs: 2000/sec per connection (pool them).
- FCM: No hard limit, but back off on errors.
- SendGrid: 100-3000 emails/sec (varies by plan).
- Twilio: 1-100 SMS/sec (varies by account).
Circuit Breaker Pattern:
class CircuitBreaker {
constructor(threshold, timeout) {
this.failureCount = 0;
this.threshold = threshold; // Open circuit after N failures
this.timeout = timeout; // Try again after X ms
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.nextAttempt = Date.now();
}
async execute(fn) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.timeout;
}
}
}
// Usage
const apnsCircuitBreaker = new CircuitBreaker(5, 60000); // Open after 5 failures, retry after 1 min
try {
await apnsCircuitBreaker.execute(() => apns.send(notification));
} catch (error) {
// Fallback: Queue for retry or use alternative provider
await retryQueue.add(notification);
}
Database Design for Notification Systems at Scale
A notification system design must account for high write volumes and fast reads. Partitioning, caching, and tiered storage keep the database responsive as volume grows.
Notification Storage Strategy:
Hot Storage (Redis):
- Recent notifications (last 24 hours)
- Unread notifications
- Fast retrieval for in-app notification feed
Warm Storage (PostgreSQL):
- Last 90 days of notifications
- Indexed for user queries
- Full search capabilities
Cold Storage (S3/Glacier):
- Historical data (90+ days)
- Compliance and audit trails
- Compressed and archived
Optimized Database Schema:
-- Partition by created_at for efficient queries
CREATE TABLE notifications (
id BIGSERIAL,
user_id VARCHAR(255) NOT NULL,
notification_type VARCHAR(50) NOT NULL,
channels JSONB NOT NULL,
content JSONB NOT NULL,
read_at TIMESTAMP,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
-- Create monthly partitions
CREATE TABLE notifications_2025_11 PARTITION OF notifications
FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
CREATE TABLE notifications_2025_12 PARTITION OF notifications
FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');
-- Indexes for common queries
CREATE INDEX idx_user_created ON notifications(user_id, created_at DESC);
CREATE INDEX idx_user_unread ON notifications(user_id, created_at DESC)
WHERE read_at IS NULL;
-- Delivery status tracking (separate table for better performance)
CREATE TABLE notification_delivery_status (
id BIGSERIAL PRIMARY KEY,
notification_id BIGINT NOT NULL,
channel VARCHAR(50) NOT NULL,
status VARCHAR(50) NOT NULL, -- sent, delivered, opened, clicked, failed
provider_id VARCHAR(255), -- External provider's tracking ID
delivered_at TIMESTAMP,
opened_at TIMESTAMP,
clicked_at TIMESTAMP,
failed_at TIMESTAMP,
failure_reason TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_notification_channel ON notification_delivery_status(notification_id, channel);
CREATE INDEX idx_status_created ON notification_delivery_status(status, created_at DESC);
Query Optimization:
-- Bad: Full table scan
SELECT * FROM notifications WHERE user_id = 'user_123' ORDER BY created_at DESC;
-- Good: Uses index
SELECT * FROM notifications
WHERE user_id = 'user_123'
AND created_at >= NOW() - INTERVAL '90 days'
ORDER BY created_at DESC
LIMIT 50;
-- Better: Fetch unread count separately
SELECT COUNT(*) FROM notifications
WHERE user_id = 'user_123'
AND read_at IS NULL
AND created_at >= NOW() - INTERVAL '30 days';
Caching Strategy (Redis):
async function getUserNotifications(userId, limit = 50) {
const cacheKey = `notifications:${userId}:recent`;
// Try cache first
let notifications = await redis.get(cacheKey);
if (notifications) {
return JSON.parse(notifications);
}
// Cache miss - query database
notifications = await db.notifications
.where({user_id: userId})
.where('created_at', '>=', Date.now() - 86400000) // Last 24 hours
.orderBy('created_at', 'desc')
.limit(limit);
// Cache for 5 minutes
await redis.setex(cacheKey, 300, JSON.stringify(notifications));
return notifications;
}
Monitoring & Observability for Notification Systems
No notification system design is complete without monitoring. Track these metrics to keep the system healthy:
Delivery Metrics:
- Delivery Success Rate: Target >99.5%.
- Time to Deliver:
- Push: P95 < 5 seconds.
- Email: P95 < 30 seconds.
- SMS: P95 < 10 seconds.
- Retry Rate: Alert if >5% need retries.
- DLQ Size: Alert if >100 items in the dead letter queue.
Infrastructure Metrics:
- Queue Depth: Alert if >10,000 items pending.
- Consumer Lag: Time between enqueue and dequeue.
- DB Query Time: P95 < 100ms.
- Cache Hit Rate: Target >90%.
Example Monitoring Dashboard (Prometheus + Grafana):
// Instrument notification service
const prometheusClient = require('prom-client');
const notificationsSent = new prometheusClient.Counter({
name: 'notifications_sent_total',
help: 'Total notifications sent',
labelNames: ['channel', 'status']
});
const notificationDeliveryDuration = new prometheusClient.Histogram({
name: 'notification_delivery_duration_seconds',
help: 'Time to deliver notification',
labelNames: ['channel'],
buckets: [0.1, 0.5, 1, 2, 5, 10, 30]
});
async function sendNotification(notification, channel) {
const timer = notificationDeliveryDuration.startTimer({channel});
try {
await channelService[channel].send(notification);
notificationsSent.inc({channel, status: 'success'});
} catch (error) {
notificationsSent.inc({channel, status: 'failure'});
throw error;
} finally {
timer();
}
}
Alert Examples:
# Prometheus alerting rules
groups:
- name: notifications
rules:
- alert: HighFailureRate
expr: rate(notifications_sent_total{status="failure"}[5m]) > 0.05
annotations:
summary: "Notification failure rate above 5%"
- alert: QueueBacklog
expr: notification_queue_depth > 10000
annotations:
summary: "Notification queue has {{ $value }} pending notifications"
- alert: SlowDelivery
expr: histogram_quantile(0.95, notification_delivery_duration_seconds) > 30
annotations:
summary: "P95 delivery time is {{ $value }}s (threshold: 30s)"
Managing all this infrastructure is complex. A notification infrastructure platform gives you built-in monitoring, 99.99% uptime SLA, auto-scaling, and analytics dashboards. Focus on features, not infrastructure.
System Design Interview Preparation
Notification system design is a common interview topic. Here is how to approach it and what interviewers look for.
Common Interview Prompt
"Design a notification system that can send 10 million push notifications, 1 million SMS, and 5 million emails per day. The system should support multiple notification types and ensure reliable delivery."
Step 1: Clarify Requirements
Start by asking about functional and non-functional requirements.
Functional Requirements
Ask these questions to scope the problem:
-
What notification types do we need to support?
- Transactional (payments, security).
- Promotional (campaigns, feature news).
- System (downtime, maintenance).
-
What delivery channels?
- Push (iOS, Android, web).
- Email.
- SMS.
- In-app.
-
Do we need user preferences?
- Opt-out methods.
- Channel choices (email-only, push-only).
- Quiet hours.
- Rate limits.
-
Do we need notification history?
- User can view past notifications.
- Search and filter features.
- Read/unread status.
-
Any special features?
- Scheduled sends.
- Batch sends (digest emails).
- Priority levels (urgent, normal, low).
- Templates and personalization.
Non-Functional Requirements
-
Scale:
- Daily volume: 10M push, 5M email, 1M SMS.
- Peak traffic: Assume 5x average during peak hours.
- Calculation: 10M push/day = 115/sec average, 575/sec peak.
-
Reliability:
- At-least-once delivery: No notifications should be lost.
- Duplicate prevention: Handle repeat sends gracefully.
- SLA: 99.9% uptime, 99.5% delivery success rate.
-
Latency:
- Push notifications: Delivered within 5 seconds.
- Emails: Delivered within 30 seconds.
- SMS: Delivered within 10 seconds.
-
Availability:
- Handle partial failures (e.g., email is down but push still works).
- Degrade smoothly during high load.
-
Cost:
- SMS is expensive (~$0.01 per message = $10k/day for 1M SMS).
- Optimize delivery to reduce costs.
Step 2: High-Level Design
Draw a simple architecture diagram and walk through the flow.
+-------------+
| Client App |
| (Triggers) |
+------+------+
|
v
+------------------+ +-------------+
| API Gateway |----->| Redis | (Rate Limiting)
| (Load Balancer) | +-------------+
+------+-----------+
|
v
+------------------+ +-------------+
| Notification |----->| PostgreSQL | (User Prefs, History)
| Service | +-------------+
+------+-----------+
|
v
+------------------+
| Message Queue | (Kafka / RabbitMQ / SQS)
| (Partitioned) |
+------+-----------+
|
+----------------+--------+----------------+
| | | |
v v v v
+----------+ +----------+ +----------+ +----------+
| Push | | Email | | SMS | | In-App |
| Processor| |Processor | |Processor | |Processor |
+----+-----+ +----+-----+ +----+-----+ +----+-----+
| | | |
v v v v
+---------+ +---------+ +---------+ +----------+
| APNs | |SendGrid | | Twilio | |WebSocket |
| FCM | | AWS SES | | | | Server |
+---------+ +---------+ +---------+ +----------+
Walk through the flow:
- Client sends a notification via API.
- API Gateway checks auth and rate limits.
- Notification Service runs logic and gets user preferences.
- Message goes to the right queue (push, email, SMS).
- Processors consume from queues.
- Providers deliver notifications.
- Status is tracked in the database.
Step 3: Deep Dive on Notification System Design
Topic 1: Why Use Message Queues in a Notification System?
Message queues are one of the most important building blocks in notification system design. They decouple creation from delivery and keep the system reliable under load.
Interviewer asks: "Why not send notifications right in the API call?"
Your answer:
- Async Processing: API returns fast (< 100ms). The work happens in the background.
- Traffic Buffering: Handle spikes without overloading other services.
- Reliability: Messages stay in the queue until processed.
- Scalability: Add more workers to go faster.
- Retry Logic: Failed messages stay in the queue for retries.
Example:
Synchronous:
API call -> Send push -> Send email -> Send SMS -> Return response
Total time: 500ms + 1000ms + 800ms = 2.3 seconds (bad UX)
Asynchronous:
API call -> Enqueue notification -> Return 202 Accepted
Total time: 50ms (good UX)
Background workers handle actual delivery
Topic 2: How to Prevent Duplicate Notifications in Your System Design?
Deduplication is a core reliability concern in notification system design. Without it, users receive the same message twice and lose trust in the product.
Interviewer asks: "What if the same API call is made twice by accident?"
Your answer:
- Unique Keys: The client sends a unique ID. The server checks for duplicates.
POST /api/notifications
{
"idempotency_key": "payment_123_notification",
"user_id": "user_456",
...
}
// Server logic
async function createNotification(request) {
const existing = await redis.get(`idempotency:${request.idempotency_key}`);
if (existing) {
return JSON.parse(existing); // Return cached response
}
const notification = await sendNotification(request);
await redis.setex(
`idempotency:${request.idempotency_key}`,
86400, // Cache for 24 hours
JSON.stringify(notification)
);
return notification;
}
- Queue Dedup: Use message ID to stop double processing.
- DB Constraints: Unique constraint on (user_id, type, external_id).
Topic 3: Handling Delivery Failures in a Notification System
Resilience to third-party outages is what separates a toy notification system from a production-grade design. Your architecture must assume that providers will go down.
Interviewer asks: "What if APNs is down? How do we avoid losing notifications?"
Your answer:
-
Retry with Exponential Backoff:
- 1st retry: 1 minute.
- 2nd retry: 5 minutes.
- 3rd retry: 30 minutes.
- After N retries: Move to Dead Letter Queue.
-
Circuit Breaker:
- Spot failures (e.g., 5 in a row).
- Open (stop sending for 1 minute).
- Half-open (try one request).
- Close if it works.
-
Fallback Providers:
- APNs fails: use AWS SNS.
- SendGrid fails: use AWS SES.
-
Dead Letter Queue (DLQ):
- Holds items after all retries fail.
- Alerts the ops team.
- Enables manual review.
Topic 4: Database Query Optimization for Notification Systems
At scale, a notification system can store billions of rows. How you index and query that data determines whether the user-facing inbox feels instant or sluggish.
Interviewer asks: "A user has 100,000 notifications. How do we query them fast?"
Your answer:
- Indexing:
CREATE INDEX idx_user_created ON notifications(user_id, created_at DESC);
- Pagination (Cursor-Based):
SELECT * FROM notifications
WHERE user_id = 'user_123'
AND created_at < '2025-11-20 10:00:00' -- Cursor
ORDER BY created_at DESC
LIMIT 50;
- Partitioning:
-- Monthly partitions
CREATE TABLE notifications_2025_11 PARTITION OF notifications
FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');
-
Caching:
- Cache the last 24 hours in Redis.
- Cache unread count.
- Clear cache on new notification.
-
Archival:
- Move notifications older than 90 days to cold storage (S3).
Topic 5: Time Zone Handling in Notification System Design
Sending notifications at the right local time is a common requirement. A well-designed notification system stores each user's timezone and schedules delivery accordingly.
Interviewer asks: "How do we send an email at 9 AM in the user's local time?"
Your answer:
- Store the user timezone in the database.
- Calculate delivery time:
function getDeliveryTime(userId, targetHour) {
const user = await getUser(userId);
const userTimezone = user.timezone; // e.g., 'America/New_York'
const now = moment().tz(userTimezone);
let deliveryTime = moment().tz(userTimezone).hour(targetHour).minute(0).second(0);
if (deliveryTime.isBefore(now)) {
deliveryTime.add(1, 'day'); // Schedule for tomorrow
}
return deliveryTime.toDate();
}
- Use a scheduled notifications queue.
- Batch by timezone to reduce queue operations.
Step 4: Notification System Design Bottlenecks & Trade-offs
Every notification system design has bottlenecks. The key is to identify them early and discuss trade-offs clearly:
Database:
- Problem: One database cannot handle 100k writes/sec.
- Fix: Shard by user_id. Use Cassandra or DynamoDB for high write speed. If you use DynamoDB, you can also leverage DynamoDB Streams to trigger real-time notifications directly from data changes.
Message Queue:
- Problem: One Kafka partition handles ~10k messages/sec.
- Fix: Use 10 partitions by user_id (= 100k messages/sec).
External Service Limits:
- Problem: APNs allows 2000 connections at ~2000 per sec each.
- Fix: Pool connections. Spread across servers.
Cost:
- Problem: SMS costs $10k/day at scale.
- Fix:
- Only send SMS for high-priority items.
- Fall back to push or email for the rest.
- Batch multiple updates into one SMS.
What Interviewers Look For in a Notification System Design
Good answers:
- Clarify requirements first.
- Discuss trade-offs (consistency vs. availability, cost vs. reliability).
- Think about scale from the start.
- Mention monitoring.
- Ask questions throughout.
Bad answers:
- Jump straight into code.
- Ignore scale.
- Skip failure cases.
- Over-build without understanding what is needed.
For teams that build notification systems for real, it goes far beyond interview questions. MagicBell's notification infrastructure solves these challenges with a simple API, built-in scale, multi-channel delivery, and full analytics. Spend your time on your product, not on notification plumbing.
Notification System Design: Key Takeaways
Notification system design is both an art and a science. The UX side needs clear info hierarchy and respect for user preferences. The backend needs message queues for reliability, patterns for clean code, horizontal scaling for performance, and monitoring for visibility.
User Experience:
- Context matters. Give users enough info to act without switching apps.
- Types differ. Transactional, system, and marketing notifications each need different handling.
- User control stops fatigue. Quiet hours, limits, and channel preferences are a must.
System Architecture:
- Message queues handle async work, traffic spikes, and reliability.
- Design patterns (Observer, Factory, Chain of Responsibility, Strategy) keep code modular.
- Horizontal scaling with stateless services handles growth.
- Rate limiting protects users from fatigue and systems from overload.
Scalability:
- Database partitioning and caching are a must at scale.
- Multi-tier storage (hot/warm/cold) balances cost and speed.
- Circuit breakers and retries ensure resilience.
- Monitoring gives visibility into system health.
The Build vs. Buy Decision
Building a production-ready notification system requires:
- Infrastructure: Queues, databases, caching, load balancing.
- Channels: APNs, FCM, SendGrid, Twilio, and more.
- Reliability: Retries, circuit breakers, dead letter queues.
- Monitoring: Dashboards, alerts, on-call rotation.
- Compliance: GDPR, CAN-SPAM, TCPA, data retention.
Rough estimate: 6-12 months for a 3-person team to build it. Ongoing upkeep needs at least 1 full-time engineer. Compare that to MagicBell's pricing plans, at a fraction of the cost.
MagicBell provides all of this infrastructure out-of-the-box:
- Multi-channel: Push, email, SMS, in-app, Slack, Teams, webhooks.
- Reliable: 99.99% uptime SLA, auto retries, at-least-once delivery.
- User preferences: Quiet hours, channel picks, rate limits.
- Real-time: WebSocket support for instant in-app updates.
- Analytics: Delivery, open, and click-through rates.
- Simple API: Send notifications with one API call.
With MagicBell, your team can ship notifications in hours, not months. Focus on your product, not plumbing.
Further Reading
- Building a Notification System in Ruby on Rails: Database Design
- Building a React Notification System
Want to see a Rails notification system implementation? Or try the MagicBell playground.
To see workflows in action, explore our templates for GitHub events (pull requests, issues, CI/CD) and Stripe events (payments, subscriptions, disputes).
