The challenge: Deliver 1B notifications (Push, Email, SMS) per day across the globe with less than 2 seconds of latency, while handling vendor failures.
A single `NotificationRequested` event is published to a Kafka topic. Multiple workers (Canary, Email, SMS) consume this same event. This allows us to scale each channel independently. If SMS volume spikes, we just add more SMS workers without affecting Email delivery.
External vendors (SendGrid, Twilio) fail or throttle us. We use **Polly Circuit Breakers**. If Twilio fails 5 times, we stop calling them for 60 seconds and automatically route the "Critical" SMS through a backup vendor (e.g., AWS SNS). This ensures our users always get their 2FA codes regardless of vendor status.
Not all notifications are equal. 'Password Reset' is high priority; 'Monthly Newsletter' is low. We use separate queues. If the system is under heavy load, the Newsletter queue is throttled, ensuring that the critical system messages always have enough CPU and network bandwidth to ship instantly.
Q: "How do you prevent sending the same notification twice if a vendor is slow?"
Architect Answer: "**External Idempotency**. We generate a `NotificationId` and pass it to the vendor. Most modern vendors (like Stripe or SendGrid) accept an idempotency key. If we retry the call because of a timeout, the vendor sees the key and realizes it's a duplicate, preventing the user from receiving two identical emails."